AI safety evolved: secure-by-design, safe-by-measurement

Authors

Bobby Filar

Head of AI

Aryan Luthra

ML Research

In the modern, AI-augmented work environment, adversaries are no longer just targeting your user – they're targeting your models.

Last week, we shipped an Attack Spotlight about a credential phishing attack that used prompt injection to attempt to confuse and bypass AI security analysis. The attack is explained in the blog post, but here’s the message with the prompt injection called out:

This prompt injection is attempting to get an AI security scanner to not see the credential phishing payload using a combination of context poisoning and self-negation techniques (full analysis in the blog).

While this type of attack is scary, it isn’t representative of what we most often see in the wild (it’s more representative of what we see in a security research POC). More often than not, we see indirect prompt injection attacks that simply attempt to fool AI email security into thinking messages are benign.

These attacks will often combine social engineering tactics to bypass traditional rule-based email security along with indirect prompt injection to bypass modern AI email security systems. To learn more about these tactics, check out our recent blog about prompt injection attacks we see regularly.

At Sublime, we are well aware of attack evolutions designed to target AI-powered security. We run autonomous AI agents on attacker-authored text, at scale, in production. It’s an unusual deployment surface and it forces a stricter notion of AI safety than most products need. Unlike most AI solutions that assume input is safe (”summarize this document”), security solutions need to assume input is unsafe (”find malicious signals in this email”).

Over the past few weeks, we’ve written pieces about how Sublime’s AI is secure by design, as well as a trust-based framework for agentic autonomy. Those posts laid out what we think AI safety should look like in security. This is the post about how we audit ourselves against it.

Security AI model maturity checklist

Autonomous AI in security is now common enough that we should stop grading it on a curve. The standards that apply to a chat assistant are not sufficient for a system that quarantines mail, drafts detections, or takes action on customer infrastructure.

Here is a maturity checklist for deploying autonomous AI against attacker-authored content. Some controls are table stakes before any automation. Others become mandatory as the system moves from advisory workflows to constrained or high-impact autonomous action.

Baseline controls

These should exist before an AI system is trusted to analyze adversarial content in production.

Prompt-injection eval corpus. Not just a policy document, but a corpus of real and synthetic adversarial examples that can be used to test how the system handles attacker-authored instructions.
Decision provenance. Every meaningful AI-assisted security decision should leave enough of a trail to be reviewed later: what input was analyzed, what the model concluded, what confidence it had, and what evidence or rubric informed the decision.
Internal red team aimed at the agent. Red teams need to attack the agents as hard as they attack the detection engine. Confirmed bypasses should become future test cases.

Controls for higher autonomy

These become increasingly important as the system moves from advisory workflows into actions that affect users, mailboxes, detections, or customer environments.

Calibration report. When the model says High confidence, how often is it right? If a vendor can’t show how confidence maps to correctness, they haven’t measured whether the system is safe to automate.
Autonomy tier policy. There must be controls over which actions the agent can take unilaterally, which require shadow mode, and which require a human. The system needs to provide evidence of success and failure so practitioners can decide when to move between levels.

Advanced operational controls

These are signs of a mature AI safety program, especially for systems operating with meaningful autonomy.

Circuit breaker. Models should have named thresholds where the agent stops itself, escalates, or returns an error instead of forcing a low-confidence decision. There should also be a named human or team responsible for reviewing those failures.

If a security-AI vendor can't produce proof of these, they are not running an AI safety program – they're running a demo. And any customer that purchases a solution that doesn’t meet these requirements will bear the engineering costs that have been deferred by the vendor.

4 pillars of AI safety at Sublime

AI safety in product contexts is too often discussed as values or principles. At Sublime, we treat AI safety as an engineering discipline with artifacts, gates, and numbers.

Artifacts: Inputs and outputs must be contextualized and secure throughout the agent loop. All decisions must be auditable.
Gates: There must be deterministic checks throughout the entire agent loop.
Numbers: There must be evals in place that measure AI efficacy, speed, and safety to enable continuous improvement.

With those disciplines in mind, we operate under four pillars of AI safety: adversarial input, calibrated output, tiered autonomy, symmetric red-teaming. Each pillar names what we already do, what we measure, and our plans for the future. Let’s look at each.

Pillar	What exists	What we measure	What failure looks like
Adversarial input	MDM trust boundaries and a corpus of in-the-wild prompt-injection samples	Whether model, prompt, or agent changes improve or regress on adversarial email content	Attacker-authored content influences the agent instead of being analyzed as evidence
Calibrated output	Verdicts with confidence, unknown handling, canary sets, and anti-rationalization rubrics	Calibration, Brier score, verdict accuracy, and uncertainty behavior	The model sounds confident when it should ask for review
Tiered autonomy	Passive/active modes, review paths, and configurable autonomy	Accuracy, false positives, regression deltas, and safe movement between autonomy levels	The system acts with more autonomy than the evidence supports
Symmetric red-teaming	Internal testing, third-party pentests, and production false-negative review	Bypasses, rubric failures, trust-boundary failures, and regression examples	Attackers manipulate the agent surface, not just the detection engine

Pillar 1: The input is the adversary

Sublime analyzes malicious emails 24/7. Most attacks are targeting inboxes, but some are targeting our agents. We don't ask our agents to be perfect and unbreakable. We ask them to fail in known, measured, recoverable ways.

What we do today: We use a variety of methods to keep our model evasion-aware and secure from prompt injections.

Agent guardrails: Email content is structurally separated from agent instructions at the prompt layer. The model is explicitly told to treat anything inside the email as evidence to analyze, never as a command to follow (i.e. prompt injection), regardless of how it's phrased (see next item). Tools are read-only by design: URL reputation lookups, header parsing, sender analysis – no write access, no external API calls. Agents cannot be weaponized.
Trust boundaries: Sublime parses messages into a JSON-style Message Data Model (MDM). When agents interact with the MDM, attacker-controlled fields like subject, body, links, and attachments are marked as untrusted content. These fields are analyzed as evidence, not treated as instructions. We test for cases where attacker-authored content attempts to influence control flow, override the agent’s task, or change how the message is evaluated.
Typed refusal: When an agent doesn’t have confidence in its analysis, it returns an unknown verdict for review by a human analyst. Automations exist to enable automatic alerting of analysts.
Behavioral-over-authority (BOA) hierarchy: Signals of authority (trusted brands, trusted infrastructure) never outweigh behavioral analysis. Agents operate under the understanding that trust is not a shortcut for analysis.

What we measure: We actively maintain a prompt injection corpus built from in-the-wild malicious emails we identify through threat hunting and production analysis. When we evaluate changes to models, prompts, or agent versions, we test against this corpus to look for regressions in how well the system recognizes adversarial instructions embedded in email content.

What’s on the roadmap: We plan to publish our prompt injection taxonomy and invite practitioners to use and contribute to it. Topic-based guardrails for future agents.

Pillar 2: Calibrated output, not confident output

AI is known for its misplaced confidence, so one of our design properties is honest uncertainty. To ensure our agents don’t confidently give wrong answers, we determine a confidence level for each verdict. This confidence level is just as important as the verdict, as it has the ability to flip verdicts. It also gives our model the ability to ask for help when needed.

What we do today: We allow our model to be candid about uncertainty rather than feign certainty.

Verdicts with confidence: Every verdict comes with a confidence value. When a confidence level is below a certain threshold, the verdict automatically returns as unknown.
Unknown matters: Verdicts of unknown get the same level of automation and alerting as any other verdict. It is not a catch-all dumping ground, it is a first-class output indicative of potentially novel TTPs.
Anti-rationalization rubrics (ARRs): As part of our CI, we evaluate decisions against ARRs to uncover and resolve logical fallacies and hallucinations.

What we measure: We use a variety of techniques to decrease uncertainty and keep our model calibrated.

We use a calibration curve to measure reliability per verdict. We measure how correct we are in order to improve our model. For example, if an agent marks 1,000 messages as verdict:malicious + confidence:high in a held-out set, we measure how often those verdicts are confirmed by human-labeled ground truth. If high-confidence malicious is correct 97% of the time one month, but only 91% after a model change, that regression blocks release. Calibration boundaries are applied to all confidence scores to ensure precision and recall for each level.
We perform Brier scoring to measure how close we are to a predicted outcome. This means that when a verdict is wrong, we measure how wrong it is.
We use counterfactual stress testing to see what changes to an email change agent verdicts. This helps us find gaps before attackers do.
We use a held-out canary set that the AI model never sees, creating a baseline for evaluating model decisions.

What's next: We plan to publish the calibration report cadence and commit to regressions blocking releases.

Pillar 3: Tiered autonomy, applied to ourselves

Autonomy isn't a slider you just turn up. It's a spectrum with levels you earn – per agent, per action, with evidence. Our agents aren’t a binary, they give teams the ability to build trust and then ramp up autonomy on their own timeline.

What we do today: Our agents have configurable autonomy settings:

All agents are transparent, explainable, and auditable to ensure trust can be built and autonomy grown.
Autonomous Security Analyst (ASA): Passive mode guides human analysts, but doesn’t take actions. Active mode lets teams ramp up autonomy from providing recommendations to automated actions within user-defined bounds to full autonomy within the agent harness.
Autonomous Detection Engineer (ADÉ): Analysis can run automatically or be manually triggered. By default, newly generated coverage must be reviewed by an analyst, but can be configured to auto-accept for full autonomy.

What we measure: We continuously evaluate verdict accuracy and model calibration. For our agents, we generate custom evals for specific security tasks and continuously benchmark performance against them. We use a regression delta to measure prompt injection safety efficacy. We measure false positive rates during the iterative process of new coverage generation to maximize efficacy.

What's next: We plan to add:

Different action tiers for ADÉ-generated rules, from draft to evaluate (analyze without action) to production.
Even more auditability with full decision provenance for every verdict, including model, prompt, rubric, and tools used. This would remain internal for the initial implementation, meaning our support engineers would be able to work with you to troubleshoot agent decisions.
A circuit breaker that pauses the agent loop on out-of-distribution decision rates (this could also fall under Pillar 2). This would return an error rather than provide a hallucinated decision.

Pillar 4: Symmetric red-teaming

Most security vendors test their detection engines with red-teaming. However, attackers do more than just try to avoid detection. They also try to manipulate the systems that analyze their messages. This is why we also red-team our agents. For us, red-teaming the agent means checking if the model not only makes the right decision, but also whether attacker-created content can affect how it gets to that decision.

What we do today: We regularly review production false negatives and threat-hunting results to find cases where attacker-created content tries to manipulate AI analysis. We also test the agent surface directly, such as checking prompt-injection and decision manipulation resistance, MDM trust boundaries, tool-use limits, and autonomy controls, using both internal tests and third-party pentests. When we confirm a bypass, it becomes an agent-safety finding and is used to improve prompts, rubrics, harness controls, and regression examples.

What’s next: We plan to:

Build an internal attacker agent designed to create emails that ASA marks as safe, but contain malicious payloads. This attacker agent will focus on specific parts of the agent surface, such as prompt-injection handling, confidence calibration, MDM trust boundaries, and rubric weaknesses. Any successful bypasses will be reviewed, labeled, and added to the adversarial corpus as regression examples.
Add capability forecasts for model upgrades that explain what has changed in safety-related behavior.

Living our AI truth

We taught our model to be honest, so we have to practice what we preach. Now that we’ve covered the steps we’re taking to improve AI safety, it’s time to talk about where we have even more room to improve.

Interpretability of individual verdicts: Currently, we do not rank detection signals for verdicts. This means we can’t say, “Because of X signal, we know that this is a malicious message.” Adding this functionality has been a lower priority due to our detection engine being designed for signal completeness (maximum security data) over signal analysis short circuits (minimum compute).
Guaranteed tool scopes: We don’t fully spec out tool-call authorization scopes for our agents. This would likely be included in external documentation of our AI harness.
Complete prompt injection prevention: No AI model is completely safe from prompt injection because adversaries (and their adversarial AI agents) are constantly developing novel attack methodologies. Pillar 3 addresses this fact.

Ending where we started

To close, let’s go back to the email from the top of the blog. With attacks like these, ASA doesn’t just provide a verdict, it provides transparency and explainability that goes a long way with security teams. Let’s look at the executive summary from analysis.

With all of this information, you don’t have to have faith that the verdict is correct. You can see exactly why Sublime came to that verdict, and if you disagree, we can audit the reasoning. This is how AI security vendors earn trust.

Expect to see more posts as we address the different What’s next features we listed for each of the four pillars. If you want to see Sublime’s agents in action, book a live demo today.

Get the latest

Sublime releases, detections, blogs, events, and more directly to your inbox.

Thank you!

Thank you for reaching out. A team member will get back to you shortly.

Oops! Something went wrong while submitting the form.