Securing Agentic AI Is a Probabilistic Problem

Pyry Haulos ·

Securely deploying agentic AI is hard. It might be solvable only in a probabilistic sense, quite different from traditional deterministic systems where correctness of the security model can be proven. The non-deterministic component in LLM-based agents creates threat vectors not typically expected from a computer program, and that same non-determinism is what makes them valuable.

The lethal trifecta

Simon Willison coined the term lethal trifecta: access to private data, exposure to untrusted content, and external communication. If all three are present, the system is fundamentally vulnerable.

The lethal trifecta of capabilities is:

  • Access to your private data—one of the most common purposes of tools in the first place!
  • Exposure to untrusted content—any mechanism by which text (or images) controlled by a malicious attacker could become available to your LLM
  • The ability to externally communicate in a way that could be used to steal your data

If your agent combines these three features, an attacker can easily trick it into accessing your private data and sending it to that attacker.

The fundamental issue underlying the trifecta is that there is no separation between instructions and data in an LLM. All data is also instructions. This creates the vulnerability known as prompt injection, and every LLM is susceptible to it. Prompt injection can be used to exfiltrate data, or compromise integrity of systems the agent has access to.

No provable solution to the lethal trifecta, apart from removing any of the three, has been presented, at least to my knowledge. Simon’s prescription is exactly that: cut off one of the three legs, since in application security 99% is still a failing grade. He’s right that this is the only path to provable security.

But the deterministic framing – remove one leg and you’re safe – understates the practical difficulty. Containment systems have their own (classical) vulnerabilities: sandbox escapes, misconfigured allowlists, side channels through connected systems like CI runners. Users accept permission prompts without reading them. And every removed capability reduces the agent’s usefulness, creating pressure to restore it. The deterministic fix erodes from both directions: the technical boundary has bugs, and the humans managing it make exceptions.

This is why agentic AI security is better understood as a probabilistic problem, closer to James Reason’s Swiss cheese model than to formal verification. Each defense layer (model resistance to prompt injection, execution sandboxing, network containment, user approval) has holes. A breach requires the holes to align across all layers simultaneously. No single layer needs to be perfect if the layers are independent and the combined probability of alignment is low enough. The practical work is identifying the largest holes in each layer and closing them in magnitude order until the residual risk is acceptable.

The Swiss cheese model of accident causation: multiple defensive layers, each with different vulnerabilities. Most threats are stopped by one layer or another, but a breach occurs when the holes happen to align across all layers simultaneously.
The Swiss cheese model of accident causation: multiple defensive layers, each with different vulnerabilities. Most threats are stopped by one layer or another, but a breach occurs when the holes happen to align across all layers simultaneously.

Model-level defenses are one such layer. Anthropic has published research on training Claude to resist prompt injection, reporting a 1% attack success rate for Opus 4.5 in browser-based agent tasks against an adaptive adversary given 100 attempts. This is not a solved problem; the International AI Safety Report 2026 found that sophisticated attackers bypass the best-defended models roughly 50% of the time given 10 attempts. But a model that is harder to trick provides genuine defense in depth when building security as a probabilistic process.

Prompt injection attack success rates, as reported by AI developers for major models released between May 2024 and August 2025. Each point represents the proportion of successful attacks within 10 attempts against a given model shortly after release. The reported success rate has been falling over time but remains relatively high. Source: Zou et al. 2025, cited in Anthropic 2025, via the International AI Safety Report 2026.
Prompt injection attack success rates, as reported by AI developers for major models released between May 2024 and August 2025. Each point represents the proportion of successful attacks within 10 attempts against a given model shortly after release. The reported success rate has been falling over time but remains relatively high. Source: Zou et al. 2025, cited in Anthropic 2025, via the International AI Safety Report 2026.

Models misbehave on their own

The threats to agentic systems fall into two distinct categories that require different mitigations. Prompt injection is an external attack: a malicious actor embeds instructions in data the agent processes, exploiting the lack of separation between instructions and data. The defenses are containment (limiting what the agent can do if compromised) and model hardening (making models better at distinguishing instructions from data).

But models also misbehave autonomously, without any adversarial input. Anthropic’s Opus 4.6 recently solved a benchmark task by decrypting the answer when it couldn’t find a solution otherwise:

Claude Opus 4.6 independently hypothesized that it was being evaluated, identified which benchmark it was running in, then located and decrypted the answer key.

[…]

Opus found the evaluation source code on GitHub via a web search and read the XOR decryption implementation. […] But the encrypted dataset was hosted as a binary file with an application/octet-stream MIME type and the harness’s web tool rejected it with a content-type error, as the tools given were designed only for text. Opus then searched for alternative mirrors of the dataset that would have a compatible format, found a third-party copy on HuggingFace that served the same encrypted data as JSON rather than binary, fetched it successfully, ran its decryption code against all 1,266 entries, and located its question. It then verified the decrypted answer by searching for the actual source material on the web, which it found.

When building the Airut sandbox, I observed Claude constructing raw DNS UDP packets to try to get through a broken network proxy when it did not have the necessary tools.

These behaviors are not prompt injection. No adversary is involved. The model pursues its goal with whatever means are available, including means the system designer did not anticipate. Containment helps here too, but the threat model is different: instead of an external attacker steering the model, the model itself finds creative paths around constraints. This makes the defense problem harder, because the model has full visibility into its own context and tools while an external attacker operates with partial information.

How the industry is responding

The attempts to bring safety to agentic AI systems broadly fail in one of two ways: they don’t properly contain the access the agent has, or they don’t account for the human factor.

Incomplete containment is the more obvious problem. The most common example is sandboxing the execution environment while leaving the network wide open. In systems like OpenClaw (and similar projects), where the execution environment is sandboxed via Docker, the agent can freely access the internet, take actions there, or exfiltrate information. Claude.ai’s agent features face a similar tension: by default, network egress is disabled, which means the lethal trifecta is not present. But the agent cannot access web resources in this configuration, and users who want that capability must grant network egress permission, at which point data exfiltration becomes trivial for a compromised model. Anthropic puts the security tradeoff decision on the user, but if users are forced to choose between security and capability, they will almost always choose capability. System design must account for this rather than delegating the risk to users who are not equipped to evaluate it.

Existing systems not designed with agentic AI in mind compound the problem. Consider an agent running in a tightly sandboxed container with no network egress, but with the ability to push code to GitHub:

  1. The agent pushes a commit containing a GitHub Actions workflow file.
  2. A GitHub-hosted runner picks up the workflow and executes it.
  3. The runner has unrestricted network egress by default.
  4. The workflow exfiltrates repository secrets or source code to an external endpoint.

The sandbox boundary stopped at the container. The CI system, which the agent can trigger indirectly by pushing code, runs outside that boundary with different (weaker) constraints. Sandboxing must extend to every system the agent is allowed to interact with, directly or indirectly.

Approval fatigue is the second failure mode, and in practice may be the larger risk. It comes from neglecting hard-won lessons in designing secure systems for humans. The prevailing approach in AI systems is to ask permission for every action that could have a side effect: running a command, accessing a resource. Between cookie banners and mandatory EULAs, most users have already been conditioned to blindly click “yes” to everything. The more vigilant ones fall to approval fatigue eventually. Claude Code is a particularly egregious example: very quickly you end up slamming “yes, and do not ask again,” and when Claude continues to find ways to trigger more prompts, you switch to yolo mode.

Beyond approval fatigue, these implementations often ask permission in a way that the user either won’t bother or can’t understand, making a confused deputy out of the user.

Two criteria for practical agentic security

I don’t believe in an easy purely technical solution; one would have been implemented already if it existed. Technical safeguards are necessary but insufficient on their own, as the previous section illustrates. Analyzing the security of an agentic system requires modeling the whole system, human included. The probability of a security failure over a session is a function of two variables: how often the user is asked to make a security decision, and how likely they are to get each decision wrong.

P(failure) = 1 - (1 - p)^n

n = number of permission requests per task
p = probability of user error per request

This is a simplification, but it captures the core tension. Reducing n (fewer permission prompts) means the agent must operate autonomously for longer stretches, which requires stronger containment guarantees. Reducing p (better user decisions) means each prompt must be clear, infrequent enough to avoid fatigue, and scoped so the user can actually evaluate it. Optimizing only for one pushes the other in the wrong direction: a system without autonomy generates so many prompts that p rises; a system that batches everything into one complex permission request makes p high for a different reason.

This leads to two criteria for evaluating the practical security of an agentic system:

  1. The longest-horizon task the agent can securely complete without asking for permission.
  2. The likelihood of the user making an informed choice when responding to a permission request from the agent.

Securely is defined based on the system. For an enterprise coding agent, it could mean the agent cannot take any destructive action, cause any visible side effects outside of ephemeral test environments, or exfiltrate any IP or secrets. For a chatbot, it could mean the agent cannot exfiltrate any private material from the chat, take destructive actions, or cause visible side effects to any resources it may have access to. These are evaluation criteria rather than precise metrics; for complex agentic systems only statistical observations can be taken, which itself reinforces the probabilistic nature of the problem.

Research directions

Optimizing for these two criteria is not easy, and I don’t yet know which of the approaches below will work until they are tried. I hope to experiment with some of them when opportunity allows. When building Airut I attempted to optimize the first criterion by running Claude Code in a sandbox that contains not only execution but also network. Weak links include CI (GitHub Actions) and reliance on an allowlist, where misconfiguration can easily create an escape vector. For the second criterion, Airut relies on PR review as the approval flow. I don’t claim this is perfect, but it is likely secure enough in practice.

Airut still runs the standard Claude Code harness within the container. I’ve been considering ideas that may require writing a new one. If the agent can control how and when permission prompts appear, it could optimize for bulk approval at the start of a task rather than interactive prompts during execution. One approach: map all side-effect-causing operations as APIs requiring signed requests, where the user holds the signing keys and approval becomes a signing operation decoupled from execution.

Combined with a policy engine and a deterministic verifier, this might allow the agent to construct a scoped permission request (a temporary policy) at the onset of a task. To avoid the confused deputy problem, there needs to be a deterministic, not AI-based, way of analyzing and explaining the policy implications to the user.

Context management is another direction worth exploring. A system that tracks what kind of tokens are in the context could enforce useful restrictions: if confidential tokens (source code, secrets) are present, network egress gets restricted; if external tokens are present, access to side effects or confidential information gets restricted. Information flow control for agents is an active research area.

The human side of the security boundary

Containment gets most of the attention, but the human factor may currently be the larger practical risk, and it is the area where existing systems are furthest behind. The productivity gains from agentic AI strongly incentivize giving agents broad access. Users cannot be expected to make correct security tradeoffs in the course of their work, especially when the implications of each permission request are not easily understood. Aviation, nuclear safety, and healthcare have decades of experience managing humans in safety-critical loops, and there is much to learn from how those fields approach the problem.

Current LLMs show little resistance to either prompt injection or creative goal pursuit. Humans, operating on biological neural networks that are also unpredictable, are statistically more resistant to coercion, partly because self-preservation and the threat of consequences act as containment signals. LLM training currently lacks comparable signals. Whether such signals can or should be introduced is an open question, but their absence may partly explain why models are so readily steered by both external and internal pressures – and why the human in the loop, for all their flaws, remains a necessary part of the security boundary.