Topic Modeling is a machine learning-powered topic detection function that automatically classifies message content into predefined categories.

Transparency and granularity are two words we think about often. While detection efficacy will always be our primary focus, we highly prioritize keeping Sublime understandable and actionable. This is why every malicious Attack Score verdict comes with clear explanations of the top signals and our Message Query Language (MQL) makes Detection Rules easy to parse and easy to adapt. Effective and transparent (a friendlier take on “trust, but verify”).

With transparency and granularity in mind, we’re excited to announce the release of Topic Modeling in beta. Topic Modeling is a machine learning-powered topic detection function that automatically classifies message content into predefined categories.

The release comes with 27 pre-defined topic categories, confidence scoring (low/medium/high), and flexible input support that includes text from attachments (direct and OCR). In this post, we’ll take a look at what Topic Modeling does, our training method, and a practical example of usage.

Topic Modeling enables a new level of granularity

Topic Modeling is the technology that powers new granular categories that were previously impossible. Being able to interact with message categories enables hyper-targeted detection, leading to fewer false negatives and false positives.

Topic Modeling is also a key component of our upcoming spam and graymail Attack Score verdicts (keep an eye out for that announcement). Spam and graymail occupy a wide space in between benign, suspicious, and malicious.

We define spam as email that’s unwanted, unsolicited, and typically high volume, including borderline scams (false sweepstake winners) and unwanted direct solicitation. We define graymail as legitimate bulk email or business to business communication that, while operating on a spectrum of wanted to unwanted, is plausibly wanted by a significant subset of recipients, even if it would be unwanted by another subset.

With Topic Modeling, Sublime can better ensure that these two types of mail end up exactly where you want them and with less risk of misclassifications.

How we built Topic Modeling

With constraints in mind, the team explored a variety of approaches for building the model, each with its own list of pros and cons. Ultimately, we landed on an approach using few-shot classification with supervised learning. Our process began by crafting precise prompts to guide an open sourced large language model (LLM) in classifying texts into topics using a few examples (few shot classification).

In collaboration with our Detection team, we then refined these prompts to ensure the LLM’s output aligned closely with our detection criteria. Once we developed a prompt that consistently met the Detection team’s results, we leveraged the LLM to generate lots of labeled data. This was then used to train a faster, more cost-effective NLU-style classifier for topics. This approach allowed us to achieve high levels of accuracy and efficiency by investing more compute in upfront classification and supervision.

List of Topics

Below are the 27 topics that we’ve included in this beta. They cover what we have found to be the most common message types for graymail and spam. Anything that falls outside of these topics is left uncategorized. For more info, see our Topic Modeling documentation.

  • Financial Communications: Banking, investments, bills, invoices, financial services
  • Legal and Compliance: Legal matters, terms of service, privacy policies, compliance
  • Customer Service and Support: Support tickets, inquiries, feedback requests
  • Professional and Career Development: Job opportunities, training, industry insights
  • Security and Authentication: Account security, password resets, 2FA, login alerts
  • Software and App Updates: Software changes, new features, bug fixes
  • File Sharing and Cloud Services: Shared files, storage notifications, collaboration
  • Secure Message: Encrypted messaging and confidential communications
  • Newsletters and Digests: Regular content compilations and updates
  • Reminders and Notifications: Event/task reminders, calendar notifications
  • Out of Office and Automatic Replies: Absence notifications, auto-responses
  • Bounce Back and Delivery Failure Notifications: Failed email delivery notices
  • Voicemail Call and Missed Call Notifications: Alerts for voicemails, calls, and missed call notifications
  • Advertising and Promotions: Marketing emails, sales, product launches
  • Events and Webinars: Event invitations, RSVPs, online/offline gatherings
  • Travel and Transportation: Trip planning, bookings, travel updates
  • Government Services: Official government communications
  • Emergency Alerts: Urgent notifications, weather, public safety
  • News and Current Events: News updates and current affairs
  • Political Mail: Campaign messages, political updates
  • Charity and Non-Profit: Fundraising, volunteer opportunities
  • Environmental and Sustainability: Updates on environmental initiatives and sustainability efforts
  • Health and Wellness: Medical appointments, health insurance, wellness
  • Educational and Research: Learning materials, academic announcements
  • E-Signature: Electronic document signing requests and updates
  • Entertainment and Sports: Movies, music, games, sports updates
  • Social Media and Networking: Social network notifications, connections

Example: Detecting fake authentication requests

We can use Topic Modeling to greatly simplify behavioral detection. As an example, let’s look at credential phishing. While there are a wide variety of attack techniques, a common approach is for attackers to send an official looking email with a link to a well-constructed page that mimics a common login page (Microsoft, Google, etc.). Here’s a simple MQL statement for easily detecting this approach:

This query uses Topic Modeling to easily identify security/authentication messages coming from scammers. Here’s what it’s doing:

  • Line 1: Only include incoming messages
  • Lines 2–5: Only include messages that Topic Modeling has categorized as “Security and Authentication” with a high level of confidence
  • Line 6-8: Only include messages that NLU has classified as "Credential Theft" with a high level of confidence
  • Line 9: Only include messages sent from a free email provider (Gmail, Proton, etc.)
  • Line 10: Exclude messages that have been classified as false positives previously
  • Line 11: Exclude solicited messages (senders that have been previously corresponded with)

The power of Topic Modeling can be seen in lines 2–5. Those few, short lines of MQL replace the need for long lists and complex regex statements while maintaining a high level of confidence in the results.

Start using Topic Modeling

Stay tuned to see how this functionality powers our upcoming graymail and spam Attack Score verdicts.

For advanced users, Topic Modeling is now available in beta for use within custom Detection Rules, threat hunts, and Automations. Review our Topic Modeling usage docs for more information.

About the Author

About the Authors

Author headshot

Aryan Luthra

ML Researcher

Aryan is a Machine Learning Researcher at Sublime, where he focuses on the intersection of AI, ML, and cybersecurity. He holds degrees in Computer Science and Physics from UC Berkeley and has previously developed ML-focused threat actor tracking algorithms at Microsoft.

Get the latest

Sublime releases, detections, blogs, events, and more directly to your inbox.

You're now subscribed. Expect a monthly email from us in your inbox.
Oops! Something went wrong while submitting the form.