Detecting Credential Phishing using Deep Learning + MQL

On this page

Ready to see Sublime  in action

Authors

Bobby Filar

Machine Learning

Our first blog provided an overview of Sublime's Message Query Language (MQL), an interface that allows our Detection and Machine Learning (ML) Teams and any Sublime user to craft sophisticated logic to detect, prevent, and hunt for a range of email attacks.

Today, we begin our series on MQL's Enrichment Functions. These functions expose vital capabilities that work with core MQL logic to create robust attack-specific detectors.

First is LinkAnalysis, one of our most effective defenses against credential phishing.

Background: An attack story

Credential harvesting is an effective phishing technique where attackers seek to deceive users into providing their login credentials via a seemingly authentic login portal. Attacks aimed at stealing credentials often impersonate brands using high-quality logos on phishing pages to enhance credibility and increase the likelihood of the victim perceiving the site as authentic.

LinkAnalysis

The LinkAnalysis function in MQL takes a link as input, navigates to the web page, and takes a screenshot. Then, it analyzes the screenshot using techniques like computer vision and assesses whether it is a likely credential phishing site. This process is described in detail below:

An inbound email message is sent to our context classifier to determine whether the links may be suspicious.
Suspicious links are sent to a headless browser to resolve the effective URL (i.e., the URL of the final page after following all redirects) and a screenshot is taken of the website.
An object detection model determines whether logos, captchas, or input boxes are in the screenshot.
If a logo is present, it is sent to a siamese network to determine if it matches a commonly impersonated brand using our internal BrandDB.
Finally, LinkAnalysis compiles this information to return the disposition for each URL.

Context Classification

The first component of credential phishing detection is the Context Classifier, which distinguishes our approach from other methods that rely on a single indicator to determine suspiciousness. Instead, we use a range of signals to evaluate a message's authenticity. We combine WHOIS and Machine Learning to analyze URLs for patterns commonly found in phishing links.

Our Natural Language Understanding (NLU) engine measures the email's tone (e.g., is the sender making an urgent request?) and intent (e.g., does our model classify the body of text as `cred_theft`), and email header metadata is analyzed to uncover impersonation attempts. We also maintain a blocklist of exploited domains to catch potential threats that may bypass the other two methods.

Together, these features form a robust and thorough approach to identifying suspicious links for deeper analysis.

Headless Browser

We use an automated headless browser to visit a website, render the DOM, and capture artifacts of interest. We employ a variety of mechanisms to combat client-side cloaking strategies to maximize the likelihood of successfully visiting a phishing page.

Detecting Objects using Deep Learning

Next up in the LinkAnalysis workflow is our object detection model. Object detection models are computer vision algorithms that identify and locate specific objects within an image or video. Object detection models gained popularity when they became foundational for self-driving car technology to detect road signs, pedestrians, and other automobiles on the road.

We use object detection to detect logos, input boxes, and buttons on a website. We chose Phishpedia, an excellent Open Source deep learning project (based on the Faster-RCNN implementation in Detectron2, as our baseline model architecture.

*End-to-end workflow of the object detection model*

The goal of the object detection model is simple: take an image as input, featurize it, and attempt to predict bounding boxes around objects (e.g., logos, input boxes) we are interested in detecting. Each bounding box has a confidence score that states the probability of the box containing the object you want.

Detecting Brands and Logins

The final piece of the puzzle consists of detecting known brands and logins. We take a three-step approach to do this effectively: Siamese Networks, optical character recognition (OCR), and pixel math.

Siamese Neural Networks

Once the object detection model detects a logo, we crop it into individual images and run them through a Siamese Neural Network to generate a feature vector. We then compare this feature vector to a database of known brand logos using a similarity calculation. If the resulting score exceeds our predetermined threshold, we can confidently conclude it is a brand impersonation.

Optical Character Recognition (OCR)

For brands with text-based logos (e.g., Hulu or Wells Fargo), we employ OCR, a computer vision technique for extracting text in images through computer vision. Combined with Siamese Networks, this approach provides comprehensive coverage for detecting logos.

Pixel Math

Pixel Math is a moniker describing our logic for determining whether an identified input box is associated with a login. We take all detected input boxes and calculate a distance metric to determine whether two boxes are stacked or side-by-side, commonly seen in login portals.

Putting It All Together: Detecting Brand Impersonation in MQL

Now, let's see LinkAnalysis in action with MQL! To detect a credential phishing page, you can use the following snippet:

Step-by-step explanation:

any(body.links, ...) – Checks if there are any links in the email body‍
beta.linkanalysis(.) – Invokes Link Analysis to analyze a particular link in the email body, potentially navigating to it and taking screenshots.‍
.credphish.disposition == "phishing" – Checks if the disposition of the link is "phishing", meaning that the link is trying to trick the recipient into revealing sensitive information.^‍
and .credphish.brand.confidence in ("medium", "high") – Checks if the confidence level of the brand associated with the link is either "medium" or "high". A higher confidence level means that the brand is most likely being impersonated.

In one detection rule for Credential Phishing, we combine the MQL above with our First-Time Sender logic to effectively defend against the most commonly observed credential phishing techniques.

Fake Microsoft O365 login screen after being analyzed by LinkAnalysis

Wrapping up

LinkAnalysis, combined with MQL's versatility, offers a formidable defense against credential phishing attacks. Our computer vision-based approach automatically identifies, flags, and prevents these threats before they can cause harm.

Heading

About the authors

Bobby Filar

Machine Learning

Bobby leads AI initiatives at Sublime as the Head of Machine Learning. He has numerous publications and patents in both offensive and defensive applications of AI and machine learning. Prior to Sublime, he led security data science at Elastic.