On this page:
Machine Learning
February 7, 2025
Topic Modeling is a machine learning-powered topic detection function that automatically classifies message content into predefined categories.
Transparency and granularity are two words we think about often. While detection efficacy will always be our primary focus, we highly prioritize keeping Sublime understandable and actionable. This is why every malicious Attack Score verdict comes with clear explanations of the top signals and our Message Query Language (MQL) makes Detection Rules easy to parse and easy to adapt. Effective and transparent (a friendlier take on “trust, but verify”).
With transparency and granularity in mind, we’re excited to announce the release of Topic Modeling in beta. Topic Modeling is a machine learning-powered topic detection function that automatically classifies message content into predefined categories.
The release comes with 27 pre-defined topic categories, confidence scoring (low/medium/high), and flexible input support that includes text from attachments (direct and OCR). In this post, we’ll take a look at what Topic Modeling does, our training method, and a practical example of usage.
Topic Modeling is the technology that powers new granular categories that were previously impossible. Being able to interact with message categories enables hyper-targeted detection, leading to fewer false negatives and false positives.
Topic Modeling is also a key component of our upcoming spam and graymail Attack Score verdicts (keep an eye out for that announcement). Spam and graymail occupy a wide space in between benign, suspicious, and malicious.
We define spam as email that’s unwanted, unsolicited, and typically high volume, including borderline scams (false sweepstake winners) and unwanted direct solicitation. We define graymail as legitimate bulk email or business to business communication that, while operating on a spectrum of wanted to unwanted, is plausibly wanted by a significant subset of recipients, even if it would be unwanted by another subset.
With Topic Modeling, Sublime can better ensure that these two types of mail end up exactly where you want them and with less risk of misclassifications.
With constraints in mind, the team explored a variety of approaches for building the model, each with its own list of pros and cons. Ultimately, we landed on an approach using few-shot classification with supervised learning. Our process began by crafting precise prompts to guide an open sourced large language model (LLM) in classifying texts into topics using a few examples (few shot classification).
In collaboration with our Detection team, we then refined these prompts to ensure the LLM’s output aligned closely with our detection criteria. Once we developed a prompt that consistently met the Detection team’s results, we leveraged the LLM to generate lots of labeled data. This was then used to train a faster, more cost-effective NLU-style classifier for topics. This approach allowed us to achieve high levels of accuracy and efficiency by investing more compute in upfront classification and supervision.
Below are the 27 topics that we’ve included in this beta. They cover what we have found to be the most common message types for graymail and spam. Anything that falls outside of these topics is left uncategorized. For more info, see our Topic Modeling documentation.
We can use Topic Modeling to greatly simplify behavioral detection. As an example, let’s look at credential phishing. While there are a wide variety of attack techniques, a common approach is for attackers to send an official looking email with a link to a well-constructed page that mimics a common login page (Microsoft, Google, etc.). Here’s a simple MQL statement for easily detecting this approach:
This query uses Topic Modeling to easily identify security/authentication messages coming from scammers. Here’s what it’s doing:
The power of Topic Modeling can be seen in lines 2–5. Those few, short lines of MQL replace the need for long lists and complex regex statements while maintaining a high level of confidence in the results.
Stay tuned to see how this functionality powers our upcoming graymail and spam Attack Score verdicts.
For advanced users, Topic Modeling is now available in beta for use within custom Detection Rules, threat hunts, and Automations. Review our Topic Modeling usage docs for more information.
Sublime releases, detections, blogs, events, and more directly to your inbox.
The latest research, attack spotlights, and product updates.
Experience Sublime’s adaptable email security platform and take control of your email environment today.