On this page:
Machine Learning
March 30, 2023
Learn how Sublime uses Siamese Neural Networks, Object Detection, and other signals within MQL to identify credential phishing attacks.
Our first blog provided an overview of Sublime's Message Query Language (MQL), an interface that allows our Detection and Machine Learning (ML) Teams and any Sublime user to craft sophisticated logic to detect, prevent, and hunt for a range of email attacks.
Today, we begin our series on MQL's Enrichment Functions. These functions expose vital capabilities that work with core MQL logic to create robust attack-specific detectors.
First is LinkAnalysis, one of our most effective defenses against credential phishing.
Credential harvesting is an effective phishing technique where attackers seek to deceive users into providing their login credentials via a seemingly authentic login portal. Attacks aimed at stealing credentials often impersonate brands using high-quality logos on phishing pages to enhance credibility and increase the likelihood of the victim perceiving the site as authentic.
The LinkAnalysis function in MQL takes a link as input, navigates to the web page, and takes a screenshot. Then, it analyzes the screenshot using techniques like computer vision and assesses whether it is a likely credential phishing site. This process is described in detail below:
The first component of credential phishing detection is the Context Classifier, which distinguishes our approach from other methods that rely on a single indicator to determine suspiciousness. Instead, we use a range of signals to evaluate a message's authenticity. We combine WHOIS and Machine Learning to analyze URLs for patterns commonly found in phishing links.
Our Natural Language Understanding (NLU) engine measures the email's tone (e.g., is the sender making an urgent request?) and intent (e.g., does our model classify the body of text as `cred_theft`), and email header metadata is analyzed to uncover impersonation attempts. We also maintain a blocklist of exploited domains to catch potential threats that may bypass the other two methods.
Together, these features form a robust and thorough approach to identifying suspicious links for deeper analysis.
We use an automated headless browser to visit a website, render the DOM, and capture artifacts of interest. We employ a variety of mechanisms to combat client-side cloaking strategies to maximize the likelihood of successfully visiting a phishing page.
Next up in the LinkAnalysis workflow is our object detection model. Object detection models are computer vision algorithms that identify and locate specific objects within an image or video. Object detection models gained popularity when they became foundational for self-driving car technology to detect road signs, pedestrians, and other automobiles on the road.
We use object detection to detect logos, input boxes, and buttons on a website. We chose Phishpedia, an excellent Open Source deep learning project (based on the Faster-RCNN implementation in Detectron2, as our baseline model architecture.
The goal of the object detection model is simple: take an image as input, featurize it, and attempt to predict bounding boxes around objects (e.g., logos, input boxes) we are interested in detecting. Each bounding box has a confidence score that states the probability of the box containing the object you want.
The final piece of the puzzle consists of detecting known brands and logins. We take a three-step approach to do this effectively: Siamese Networks, optical character recognition (OCR), and pixel math.
Once the object detection model detects a logo, we crop it into individual images and run them through a Siamese Neural Network to generate a feature vector. We then compare this feature vector to a database of known brand logos using a similarity calculation. If the resulting score exceeds our predetermined threshold, we can confidently conclude it is a brand impersonation.
For brands with text-based logos (e.g., Hulu or Wells Fargo), we employ OCR, a computer vision technique for extracting text in images through computer vision. Combined with Siamese Networks, this approach provides comprehensive coverage for detecting logos.
Pixel Math is a moniker describing our logic for determining whether an identified input box is associated with a login. We take all detected input boxes and calculate a distance metric to determine whether two boxes are stacked or side-by-side, commonly seen in login portals.
Now, let's see LinkAnalysis in action with MQL! To detect a credential phishing page, you can use the following snippet:
any(body.links,
beta.linkanalysis(.).credphish.disposition == "phishing"
and beta.linkanalysis(.).credphish.brand.confidence
in ("medium", "high")
)
Step-by-step explanation:
any(body.links, ...)
– Checks if there are any links in the email body
beta.linkanalysis(.)
– Invokes Link Analysis to analyze a particular link in the email body, potentially navigating to it and taking screenshots.
.credphish.disposition == "phishing"
– Checks if the disposition of the link is "phishing", meaning that the link is trying to trick the recipient into revealing sensitive information.and .credphish.brand.confidence in ("medium", "high")
– Checks if the confidence level of the brand associated with the link is either "medium" or "high". A higher confidence level means that the brand is most likely being impersonated.In one detection rule for Credential Phishing, we combine the MQL above with our First-Time Sender logic to effectively defend against the most commonly observed credential phishing techniques.
LinkAnalysis, combined with MQL's versatility, offers a formidable defense against credential phishing attacks. Our computer vision-based approach automatically identifies, flags, and prevents these threats before they can cause harm.
Sublime releases, detections, blogs, events, and more directly to your inbox.
The latest research, attack spotlights, and product updates.
Experience Sublime’s adaptable email security platform and take control of your email environment today.