How Sublime's AI agents are secure by design

Authors

Bobby Filar

Machine Learning

AI agents in production environments process sensitive data, make consequential decisions, and interact with critical infrastructure. The upside is obvious: faster triage, fewer blind spots, and less toil.

The risks are just as real. In the past year, the OpenClaw agent framework gained millions of users while granting broad permissions across email, file systems, and shells. Since then, researchers have observed both the existence of vulnerable skills and instances of malicious usage. Separately, indirect prompt injection research has shown that agents can be manipulated into actions their operators never intended.

The common pattern is consistent: agents designed for maximum capability first, with security considered later. They connect to arbitrary tools, operate with implicit trust, and when something goes wrong, the blast radius can be enormous.

For security teams evaluating agents, the question is straightforward: can we trust these agents?

At Sublime, we designed our agents to answer that at the architecture level, not the policy level.

Guiding principles for AI at Sublime

Before we get into our agents, here are the guiding principles of the Sublime AI/ML team. These principles act as a North Star for the team that builds and maintains our agents:

Privacy-first. Sublime does not train agents on customer email data. Any data used for evaluation has PII removed before use. Customer data processed by the agents is not retained after the API call completes.
Human oversight by default. All AI features are disabled by default and come with multiple levels of autonomy (full or human-in-the-loop). No AI system makes automated decisions affecting customers unless explicitly activated by their organization.
Transparency + explainability. Sublime agents don't just output conclusions, they produce structured reasoning tied to verifiable evidence (e.g., tool outputs, citations, and decision traces). This makes results reviewable by humans and auditable for security operations.
Continuous model monitoring. Models are monitored daily for drift and degradation, both globally and at the organizational level.

The agents: ASA and ADÉ

In 2025, Sublime released its first two AI agents for email security: ASA (Autonomous Security Analyst) and ADÉ (Autonomous Detection Engineer).

ASA triages user-reported emails and helps remediate threats.
ADÉ turns missed attacks into new, org-specific detections in hours.

Together, they work like a digital SOC team inside your environment. We’ve written (and published a paper) about efficacy extensively, so this post will focus on the other half of the story: the guardrails that keep them secure.

What it can do	ASA	ADÉ
Analyze messages and produce triage decisions	✅	–
Triage user-reported emails autonomously	✅	–
Generate detection rules in MQL	–	✅
Run backtests against your environment	–	✅
Produce structured verdicts with citations	✅	✅
Operate within customer-defined precision thresholds	✅	✅
What it cannot do (enforced by the platform, not the model)	ASA	ADÉ
Delete emails or modify system configuration	❌	❌
Call arbitrary external APIs or URLs	❌	❌
Access systems or data outside Sublime platform scope	❌	❌
Install plugins or connect to external services	❌	❌
Take irreversible action without human approval	❌	❌
Send customer data to third-party model providers	❌	❌

Built for one job, not every job

ASA and ADÉ were purpose-built for email security. That constraint is the foundation of their security posture.

Unlike general-purpose agents designed to roam across an enterprise, Sublime’s agents operate within the Sublime platform boundary. They cannot call arbitrary APIs, browse the web, install plugins, or connect to external services. Their scope is the email data already in your environment.

There is no plugin marketplace to poison. There is no tool chain to exploit. The attack surface is fixed, and the boundary is enforced by infrastructure, not by the agent's own judgment.

A fixed, known set of tools

ASA analyzes suspicious emails using the same Sublime platform tools available to human analysts: file explosion, link analysis, natural language understanding, sender history, logo detection, and screenshot analysis. These are internal capabilities, not external integrations.

ADÉ generates detections in Message Query Language (MQL), Sublime’s domain-specific detection language. It produces human-readable logic that can be reviewed, edited, and backtested before it touches a single mailbox.

Each agent operates with a carefully scoped set of capabilities and privilege matched to its function:

ASA can analyze messages and produce triage decisions. It cannot create coverage or modify system configuration.
ADÉ can generate detections and run backtests. It cannot call arbitrary URLs or access systems outside its scope.

Security is enforced by the platform, not by the agents

This is the architectural decision that matters most: all security controls are enforced by the platform, not by the AI agents themselves.

Multi-tenancy isolation, role-based access control, and data sovereignty boundaries are enforced by the platform. An AI agent making a request is subject to the exact same permission checks as a human analyst.

We don’t rely on model alignment to enforce security policies. The authorization layer doesn’t care how clever an attempted prompt injection is. It only cares whether the request has proper permissions. The agents can reason differently, but they can’t act differently than their permissions allow.

Prompt injection is mitigated, not ignored

Indirect prompt injection, where malicious content in an email manipulates the agent analyzing it, is the risk that makes security AI categorically different from other AI deployments. Most automated systems face a static threat environment. Security AI doesn't. An attacker who understands how your agent reasons can probe its decision boundaries, identify what it ignores, and craft inputs designed to exploit the gap. The threat model adapts to the system.

There is no perfect defense against prompt injection today, but we make it significantly harder to accomplish and limit the damage if an attack succeeds.

We layer protections:

Architectural separation between system instructions and user input through AWS Bedrock’s Converse API, so email content is isolated from the agent’s control flow.
Structured prompts with clear boundaries between instructions and data.
Controlled input sources: all input comes through well-defined tool outputs with expected formats, not freeform text.

Each layer makes an attack harder. And critically, even if prompt injection succeeds in manipulating an agent's reasoning, the platform access controls above limit what the compromised agent can actually do.

Human-in-the-loop for consequential actions

Not all autonomy is equal, and we don't treat it as such. By default, both ASA and ADÉ operate in a mode where the agent makes recommendations and a human approves before any consequential action is taken. Moving to higher levels of autonomy is something organizations do deliberately, based on demonstrated evidence, not something that happens automatically.

Configurable thresholds and operational guardrails give organizations full control over what level of autonomy they want our agents to have.

ASA produces a verdict and a detailed report. It does not delete emails, modify configurations, or take irreversible actions on its own. Additionally, ASA has three modes: autonomous with remediation, autonomous without remediation (analyst-in-the-loop), and disabled.

ADÉ proposes detection rules. Even with auto-activation enabled, it operates within customer-defined precision thresholds. If a recommendation doesn't clear your quality bar, it routes for human review. You set the bar. ADÉ respects it.

Transparent, auditable, and recoverable by default

ASA generates structured verdicts with tool-by-tool citations, showing what evidence it found and how it reached its conclusion.

ADÉ produces detections with supporting backtest results, precision scores, and a full reasoning summary.

Every step is logged. Every decision is traceable. If an agent’s reasoning seems off, whether from drift, injection, or any other cause, the explanation makes the problem visible. Visibility without control isn’t enough. We treat detection, response, and rollback as core safety layers. Teams can monitor agent behavior, intervene before high-impact actions, and quickly remediate if needed.

Your data stays in your environment

ASA and ADÉ run on AWS Bedrock within the customer’s deployment region. No email content or analysis results are sent to third-party model providers. No data goes to any external service. Customer data remains resident in the Sublime instance.

Five questions to ask any vendor shipping AI agents

If you’re evaluating agents for your security stack, ask these questions:

What can the agent access? Sublime’s agents access only the email data in your environment.
What actions can it take? ASA produces reports. ADÉ produces detections. Neither takes destructive actions outside your configured policies.
Can you see exactly what it did? Every verdict and recommendation includes reasoning with evidence.
Where does my data go? It stays in your Sublime instance. No third-party model providers.
Who controls the agent’s autonomy? You do. Disabled by default with a human-in-the-loop option. Configurable auto-activation with your thresholds when you’re ready.

Agent security can’t be an afterthought

The last thing you need is an AI security agent to be compromised by an adversary. That’s why Sublime's agents are designed with narrow scope, enforced boundaries, and human oversight. These aren't limitations on what ASA and ADÉ can do – they're what makes them trustworthy to deploy in your security environment.

See how Sublime's agents protect your environment without asking you to trust a black box. Get a live a demo.

Get the latest

Sublime releases, detections, blogs, events, and more directly to your inbox.

Thank you!

Thank you for reaching out. A team member will get back to you shortly.

Oops! Something went wrong while submitting the form.