How does the LLM-as-judge evaluation pipeline avoid the cyclical dependency problem?

The cyclical dependency arises because you're using an LLM to evaluate an LLM. Stripe broke the cycle by anchoring evaluation in human judgment: security engineers created golden standard test cases with expected outputs based on real past security reviews. The LLM judge is only tasked with the narrower question of semantic equivalence between the agent's output and the human-defined expected output — not with determining correctness from scratch.

What accuracy threshold should an AI security agent reach before going to production?

The right threshold depends entirely on your output path. If the agent outputs directly to engineering teams with no human review, inaccurate findings create immediate noise and erode trust — requiring a much higher bar. If a security engineer reviews and confirms outputs before they reach engineering teams, you can launch at around 80% accuracy and iterate in production, because the human-in-the-loop gate absorbs the remaining uncertainty.

Why did Alpha Evolve fail to improve the threat modeling agent's prompt?

Alpha Evolve uses natural-selection-style mutation to evolve prompts, which works well in constrained computational domains where small changes produce measurably different outputs. For open-ended language tasks, the mutation engine generates variations that are semantically equivalent to the original — trivially adding two words or paraphrasing without changing meaning. There is no fitness gradient to climb, so the evolutionary process stalls without producing meaningful improvements within practical cost constraints.

AI Security Agents in Production: Stripe’s Guardrails Playbook

Q: Why did Stripe choose a sequential multi-agent pipeline instead of a fully autonomous orchestrator for threat modeling?

When the orchestrator was given full autonomy to decide which specialized agents to invoke, it inconsistently skipped relevant agents depending on prompt phrasing and non-deterministic LLM behavior. A sequential pipeline with constrained orchestration — input agents first, security agents in parallel, output agents last — guarantees predictable execution order, which is non-negotiable for security workflows where missing a risk category is a meaningful failure.

When a security team at Stripe faced a backlog of hundreds of threat modeling tickets and routing requests, they didn’t hire more engineers — they shipped AI security agents in production. But deploying LLMs for real security decisions without guardrails beyond vibes is a recipe for recreating the exact problem you’re trying to solve: hallucinated risks, misrouted tickets, and engineers spending more time cleaning up AI output than doing security work.

This post breaks down exactly how Stripe’s security engineers Jeffrey Zhang and Siddh Shah designed, evaluated, and rolled out two production AI agents — a threat modeling agent and a security routing agent — including the multi-agent architecture choices, the LLM-as-judge evaluation pipeline, and the hard-won lessons about when to trust AI output and when to put a human back in the loop.

Key Takeaways

You'll learn how to architect production-grade multi-agent security systems that balance determinism with LLM flexibility — including why sequential pipelines outperform fully autonomous orchestrators for predictable security workflows.
You'll be able to build a rigorous LLM evaluation pipeline using gold-standard test cases and semantic equivalence scoring, so you can measure and improve agent accuracy before changes reach production.
Apply a phased rollout and human-in-the-loop model to safely deploy AI security agents — starting in shadow mode, iterating on feedback, and setting appropriate accuracy thresholds based on your specific failure-mode risk.

Multi-Agent Architecture for Security Automation

When Stripe’s security team faced two persistent scaling problems — a growing backlog of threat modeling tickets and developer confusion about which security team to contact — they turned to AI agent security in production. The architectural decisions they made for each agent were deliberately different, driven by the distinct nature of each task.

The Core Design Decision: Async vs. Conversational

Before writing a single line of code, the team asked a foundational question: does this agent need to be fast and conversational, or can it run as a long-running async background process?

For the threat modeling agent, they chose async. A security review ticket arrives in a queue, and the agent can take the time it needs — minutes, not seconds — to produce a thorough threat model. This trade-off freed them to prioritize accuracy and depth over response latency.

For the security routing agent, the expectation was conversational: a developer asks a question and expects a near-immediate answer. This constraint drove an entirely different architecture.

Threat Modeling Agent: Sequential Multi-Agent Pipeline

Sequential multi-agent pipeline for threat modeling: orchestrator, input agents, specialized security agents, output agents

The threat modeling agent uses a modular, sequential multi-agent security architecture with four layers:

Orchestrator agent — Accepts inputs from the security review intake form (e.g., review category) and coordinates the overall pipeline.
Input agents — Retrieve additional context referenced in tickets: Google Doc links, Slack threads, supplementary materials. Their sole job is enriching the context before analysis begins.
Specialized security agents — Domain-specific agents targeting distinct review categories (e.g., a third-party vendor review agent). Each is scoped narrowly to its category and must cover a defined baseline of required questions — data sensitivity, transport protocols, authentication story — regardless of what else it chooses to evaluate.
Output agents — Transform results into audience-appropriate formats. Security engineers want a summarized human-readable format; internal tooling may require structured MITRE ATT&CK^[1]-aligned output for metrics.

Why Sequential, Not Autonomous?

The team initially considered giving the orchestrator agent full autonomy to decide which specialized agents to invoke. They found a critical problem: when the orchestrator had too much agency, it didn’t consistently invoke the relevant specialized agents. Predictability collapsed.

Their solution was a sequential pipeline with constrained orchestration: input agents always run first, specialized security agents run in parallel where possible, output agents run last. This deterministic ordering produces predictable behavior — a non-negotiable property for security workflows.

They are exploring a hybrid approach for the future: a required baseline set of security agents always runs (human-defined), and the orchestrator can optionally layer on additional specialized agents for vague or ambiguous review categories. This preserves predictability at the core while allowing flexibility at the edges.

Balancing Determinism and LLM Flexibility

Each specialized security agent operates with a defined set of required coverage areas — a baseline of questions that must be addressed for the threat model to be considered complete. This isn’t a constraint that limits what the LLM can reason about; it’s a floor that ensures completeness. The agent can and should identify risks beyond the baseline, but it cannot skip the baseline.

This approach directly mitigates one of the key hallucination mitigation concerns: an LLM left entirely unconstrained may omit critical risk categories, not because it can’t reason about them, but because nothing in the context made them salient.

Internal Guidance Tools as a Force Multiplier

A deliberate architectural choice was providing specialized agents with company-specific internal guidance tools rather than relying solely on the LLM’s general security knowledge. LLMs are capable of generic security reasoning; what they cannot do without help is reason about Stripe-specific infrastructure, internal tooling, and organizational risk tolerances.

By equipping agents with tools that surface internal documentation and guidance, the team made actionable, context-specific risk identification possible — risks with concrete mitigations relevant to how Stripe actually builds software, not textbook recommendations.

Security Routing Agent: Iterative Tool-Use Architecture

Three-phase routing agent evolution: one-step LLM call, full agentic structure, two-tool pruned architecture

The security routing agent had a simpler goal — route a developer’s question to the correct internal security team — but the open-ended nature of that task made it architecturally distinct.

The team’s evolution went through three phases:

Phase 1 — One-step LLM call (fast, inaccurate): A single prompt loaded with static pre-contextual information about Stripe’s security teams. No tools, no internet access, no internal document lookup. It ran fast but hallucinated regularly on any question involving internal Stripe terminology, tooling names, or team structures. The LLM had no way to resolve ambiguity — it invented answers.

Phase 2 — Full agentic structure (accurate, slow): The team removed the static context from the prompt and gave the agent a broad set of tools to research on its own. Accuracy improved substantially, but runtime ballooned to approximately 10 minutes — completely unusable for a conversational use case.

Phase 3 — Minimum viable tool set (accurate, fast): The team ran a systematic reduction process: starting from a baseline set of test questions with known correct answers, they removed tools one by one, retested accuracy, and kept only the tools whose removal degraded accuracy. After this iterative pruning, they arrived at two tools with a runtime of approximately 30 seconds — a 20× performance improvement over Phase 2, with accuracy preserved.

This reduction process is a reusable methodology: define a golden baseline of test cases, measure accuracy, prune tools incrementally, stop when accuracy drops. The minimum viable tool set for your agent may be much smaller than you think.

Security Routing Agent Evolution: From One-Step LLM Call to Two-Tool Agentic Structure

Proof of Concept

Phase 1 — One-Step LLM Call (Baseline): The initial implementation was a single LLM call with no tool access, no internet connectivity, and no access to internal Stripe systems. The prompt contained a large block of pre-loaded contextual information describing the various security teams at Stripe and their responsibilities. This approach was fast and architecturally simple — no orchestration, no tool calls, no latency beyond the model inference time. However, it had a fundamental knowledge gap: the prompt context could not cover all internal Stripe-specific terminology, tools, or software projects. When users asked questions referencing internal systems or terms not present in the pre-loaded context, the model hallucinated — fabricating team assignments or routing guidance with no grounding in Stripe’s actual structure. This directly recreated the core problem the agent was meant to solve: developers were routed to the wrong team, forcing re-engagement and wasted time.
Phase 2 — Full Agentic Structure (Maximum Tool Access): To address the hallucination problem caused by missing internal context, the team shifted to a fully agentic architecture. The pre-loaded contextual information was removed from the prompt entirely. Instead, the model was instructed to research on its own using a broad set of tools — as many tools as could be provided to ensure complete information coverage. The agent was now able to look up internal documentation, terminology, and team structures dynamically. Accuracy improved significantly. However, the runtime cost was severe: the agent took approximately 10 minutes per query. For a security routing use case where developers need a near-real-time answer to get unblocked, a 10-minute response time is not conversational and not production-viable.
Phase 3 — Iterative Tool Pruning (Minimum Viable Tool Set): With accuracy validated at full tool access, the team began a slow, deliberate pruning process to reduce the tool set without sacrificing accuracy. The methodology was: establish a baseline set of known questions with known correct routing answers (ground truth); run the agent against this test set; remove one tool from the available set; retest accuracy against the same baseline questions; if accuracy did not degrade meaningfully, permanently remove that tool; repeat until further removal causes accuracy to drop. This iterative, measurement-anchored pruning process — tool by tool, test by test — is what distinguished this from guesswork. Each removal decision was grounded in empirical accuracy data, not intuition.
Phase 4 — Final Configuration (Two Tools, ~30 Seconds): After completing the pruning process, the team converged on a configuration with exactly two tools. The runtime dropped from 10 minutes to approximately 30 seconds — a 20x improvement — while maintaining the accuracy gains from the agentic phase. This configuration is the production-deployed version of the security routing agent.
Key Architectural Principle: The routing agent’s evolution illustrates a broader principle explicitly stated by the Stripe team: agent architecture depends on the task. The threat modeling agent, with its structured multi-step pipeline and parallel specialized agents, would have been wrong for security routing. Security routing involves open-ended questions from any developer about any security topic — a domain too wide for a pre-loaded prompt and too conversational for a 10-minute agentic run. The two-tool, 30-second agentic structure is right-sized for this task. There is no one-size-fits-all answer.

Architecture Depends on the Task

The central lesson from both agents is that there is no one-size-fits-all agent architecture for security automation. The threat modeling agent needed structure, layering, and sequential predictability because the output would drive consequential security decisions. The security routing agent needed speed and breadth because it was handling open-ended conversational queries.

Choosing the right architecture requires answering: What is the failure mode? What is the acceptable latency? How predictable does the pipeline need to be? The answers drive the design.

Actionable Takeaways

When designing a security automation agent, decide early whether it runs async (optimize for accuracy and depth) or conversational (optimize for latency and conciseness) — this single decision shapes every downstream architectural choice including tool selection, agent structure, and acceptable accuracy thresholds.
Use a sequential multi-agent pipeline with a defined baseline of required coverage areas for high-stakes security tasks like threat modeling. Give the orchestrator bounded agency, not full autonomy, to preserve predictable behavior across runs.
To find the minimum viable tool set for an agentic workflow, build a golden baseline of test cases with known correct answers, then prune tools one by one and retest — stop when accuracy degrades. This prevents bloated, slow agent pipelines.

Common Pitfalls

Giving the orchestrator agent full autonomy to select specialized sub-agents leads to inconsistent invocation — the agent may skip relevant specialized agents depending on prompt phrasing, ticket wording, or non-deterministic LLM behavior. The result is incomplete threat models that vary across runs for the same input.
Starting with a maximal tool set for an agentic workflow (providing every available tool "just in case") produces accurate but unusably slow agents. The security routing agent took 10 minutes to run in this configuration. Always prune toward minimum viable tool set using accuracy-anchored testing.

Building an LLM Evaluation Pipeline for Security Agents

Why Deterministic Scoring Fails for Threat Modeling

When Stripe’s engineers first attempted to validate their threat modeling agent, they looked to deterministic matching: compare MITRE ATT&CK^[1] categories in the agent’s output against an expected set. The approach broke down immediately. The same underlying risk would be assigned to different MITRE categories across separate runs — not because the agent was wrong about the risk, but because category labeling is inherently ambiguous. Keyword matching fared no better. Neither method captures what actually matters in a threat model: the semantic meaning of the risk being communicated.

This is a fundamental challenge for AI security agents in production. Threat modeling is more art than science — there is no single correct answer, and any scoring system that treats it as a lookup problem will produce misleading accuracy numbers.

LLM-as-Judge: Breaking the Cyclical Dependency

The team turned to LLM-as-judge^[2] evaluation — using a language model to score the semantic equivalence between the agent’s output and an expected result. But this surfaces a real epistemological problem: you’re using an LLM to evaluate the output of an LLM, which creates a cyclical dependency. If you don’t fully trust the agent to produce a correct threat model, why trust a separate LLM to correctly evaluate it?

Stripe’s answer was to anchor the evaluation in human judgment. Rather than asking the LLM to evaluate in isolation, they built a pipeline where humans define the ground truth:

Human-curated golden standard test cases — Security engineers created a set of expected outputs based on real, past security reviews. These gold standard cases represent what a competent security engineer would produce for a given input ticket.
LLM scores semantic equivalence — Given the gold standard expected output and the agent’s actual output, the LLM judge is tasked with a much narrower question: are these semantically equivalent? It evaluates whether the risks and mitigations in the agent’s output match the meaning of those in the gold standard — not whether they use the same labels or wording.
Accuracy is aggregated across the golden set — Each test case receives a score; overall accuracy is derived from the full set.

This hybrid model takes advantage of what humans do well (defining what a correct answer looks like) and what LLMs do well (semantic reasoning at scale across many test cases).

Using the Eval Pipeline to Gate Production Changes

The eval pipeline became the control gate for all changes to the threat modeling agent before they were released. The workflow operated in offline mode: any change to the agent’s prompt or architecture had to pass through the evaluation pipeline against the golden set before a pull request could be merged.

Two concrete outcomes from running this pipeline illustrate its value:

Prompt domain injection — +10% accuracy improvement: The team analyzed test cases with particularly low scores and used them to guide prompt improvements. Rather than overfitting the prompt to those edge cases, they used the failures to identify patterns: the agent wasn’t reliably considering certain security domains. Adding explicit security domains to always consider — such as authorization and single sign-on — produced a roughly 10% increase in overall accuracy across the golden set.

LLM model selection — +10% accuracy improvement: The team used the eval pipeline to systematically compare baseline LLM models for the threat modeling task. To reduce noise from the non-deterministic nature of AI outputs, they duplicated each golden test case to create a larger dataset, then ran each model candidate through the full set. The model with the highest aggregate score was selected. This comparison, which would have been impossible with manual review, yielded another ~10% accuracy gain.

The hidden regression catch: Perhaps the most important demonstration of the eval pipeline’s value came from a change that looked correct in isolation. A prompt update was added that instructed the agent on how to format its JSON output. On a single manual run, the output appeared fine: correctly structured JSON, reasonable security advice. But the eval pipeline revealed that overall accuracy had dropped 10% across the golden set. The reason: the agent was now spending cognitive attention on JSON formatting instructions, at the expense of the quality and depth of its actual security guidance. Without the pipeline, this regression would have shipped to production.

JSON Formatting Prompt Change That Silently Reduced Threat Model Accuracy by 10%

Proof of Concept

Baseline prompt in place: The threat modeling agent was operating with an existing prompt that instructed the LLM to act as a security engineer and generate threat models in a structured format. At this stage, the evaluation pipeline showed acceptable accuracy against the golden standard test cases.
Prompt change introduced: An engineer added a section to the prompt describing how to properly format the JSON output for the threat model. The intent was correct — the agent needed to emit valid, structured JSON consumed by Stripe’s internal threat modeling tool. On the surface, this looked like a helpful clarification.
Spot-check passes: On individual spot-check runs, the modified prompt appeared to work correctly. The agent was outputting a correctly formatted JSON threat model and was still producing security advice. No obvious regression was visible from manual inspection of a handful of outputs.
Evaluation pipeline reveals the regression: When the full golden standard test suite was run through the LLM-as-judge evaluation pipeline, the aggregate accuracy score dropped by 10% compared to the prior prompt version. The eval pipeline compared each agent output against the expected output in the golden standard test case using semantic equivalence scoring — not just syntactic correctness.
Root cause identified: The LLM was now spending a disproportionate share of its attention budget on satisfying the JSON formatting instructions — correctly structuring output keys, nesting, and format constraints — at the expense of the depth and accuracy of the actual security guidance it was generating. The model’s behavior shifted from “reason about security risks and express them” to “produce correctly formatted JSON that contains security-adjacent text.” The risks and mitigations communicated were less precise, less complete, and less accurate overall, even though the output was syntactically valid.
Remediation path: The prompt change was caught before merging to production because the offline evaluation pipeline was a required gate on pull requests touching the agent prompt. The engineer could iterate on prompt structure — for instance, separating JSON formatting concerns from security reasoning instructions, or using system-level formatting constraints rather than inline prompt instructions — and re-run the eval pipeline until accuracy was restored or exceeded baseline.
Broader lesson applied: This incident became a core argument for investing in the evaluation pipeline early. Without it, this type of silent regression — where individual outputs look fine but aggregate security quality degrades — would reach production and erode the value of the threat modeling agent without any clear signal of what changed or when.

The Security Routing Evaluation Approach

The evaluation model for the security routing agent was different by necessity. Security routing handles open-ended user questions with no predictable structure — there is no fixed expected output to compare against. Deterministic or even semantic scoring against a golden set is not practical.

Instead, the team used an iterative, user-feedback-driven evaluation loop:

Phase 1: Released the routing agent on an internal web page accessible to team members across Stripe. Collected qualitative feedback on routing accuracy.
Phase 2: Made the agent invocable in Slack to test real-world, context-dependent usage. Slack conversations introduced more ambiguous, incomplete queries — closer to how developers actually ask security questions.
Phase 3: After validating confidence from Phase 2, released the agent to all developers in Stripe’s internal chat UI for agents.
Ongoing: Ran demos and direct outreach to gather structured feedback from representative users.

This progressive bubble rollout — small internal group, Slack context, full developer population — gave the team a controlled feedback signal at each stage before expanding exposure. The open-ended nature of routing made user feedback the only reliable signal for accuracy.

Accuracy Thresholds and the Human Review Gate

One question the eval pipeline raises is: what accuracy threshold is good enough to ship? The answer depends directly on the deployment model and the cost of failure.

For the threat modeling agent, Stripe chose a target of approximately 80% accuracy for the initial rollout — lower than you might expect — but this threshold was made safe by a human-in-the-loop review gate. The agent gets security engineers roughly 90% of the way to a complete threat model; a human reviewer then confirms which risks and mitigations are applicable before the output is acted on.

This is a deliberate trade-off: if the agent were autonomously sending threat model outputs directly to engineering teams with no human review, the accuracy bar would need to be significantly higher, or those teams would be flooded with noisy, incorrect findings — recreating the exact problem the agent was meant to solve.

Scaling the Golden Set with Test Case Duplication

A practical limitation of LLM evaluation is non-determinism: the same input can produce meaningfully different outputs across runs, making single-run accuracy scores noisy. Stripe’s solution was simple and effective: duplicate each golden test case multiple times to create a larger evaluation dataset. Running the agent across many copies of each test case means accuracy scores reflect genuine capability rather than sampling variance. This technique is directly applicable to any team building eval pipelines for security agents.

Actionable Takeaways

Anchor LLM-as-judge evaluation in human-curated golden standard test cases rather than letting the LLM evaluate in isolation — this breaks the cyclical dependency and ensures your accuracy scores reflect real-world security judgment, not just model self-consistency.
Run every prompt and architecture change through your eval pipeline against the full golden set before merging — spot-check runs on individual outputs are not sufficient and will miss regressions like the JSON formatting change that silently dropped accuracy by 10%.
Duplicate golden test cases to create a larger evaluation dataset, which reduces scoring noise from LLM non-determinism and makes model comparison (e.g., selecting the best flagship LLM for your task) statistically reliable rather than a single-run coin flip.

Common Pitfalls

Relying on deterministic scoring methods (MITRE category matching, keyword matching) for open-ended security tasks like threat modeling — these methods penalize correct risks that use different labels or terminology, producing misleading accuracy numbers that don't reflect actual agent quality.
Skipping the eval pipeline and trusting manual spot-checks: a prompt change that looks correct on a single run can silently reduce overall accuracy across the full task distribution, as Stripe's JSON formatting example demonstrated — the only way to catch this class of regression is aggregate scoring across a golden set.

Guardrails Design: Hallucination Mitigation and Input Quality

Hallucination as a Self-Defeating Failure Mode

One of the most critical challenges in deploying AI guardrails for security agents in production is hallucination — and at Stripe, this wasn’t treated as an abstract risk. It was framed as a system-level threat to the value of the entire program. As Zhang and Shah put it directly: hallucinations can recreate the exact problem you were trying to solve in the first place.

For the security routing agent, a hallucinating model that sent developers to the wrong team would force engineers back into the loop to untangle the mess — exactly the friction the agent was designed to eliminate. The failure mode wasn’t just incorrect output; it was compounding overhead that made the system net-negative.

For the threat modeling agent, the problem had a different shape: garbage input tickets. When a ticket was sparse or vague — missing details about data sensitivity, transport protocols, or the software system being reviewed — the LLM would hallucinate plausible-sounding but fabricated findings. In one example from the talk, a low-detail ticket caused the model to produce threat model entries mentioning encryption and key rotation, despite neither topic appearing anywhere in the ticket.

Teaching the Agent to Say “I Don’t Know”

The fix required a behavioral shift in the model — away from confident fabrication and toward security-engineer-like epistemic humility. The team explicitly trained the prompt to treat incomplete information as a signal to surface uncertainty rather than fill gaps with invented content.

Concretely, this meant:

Flagging missing fields as unknown status — if a required piece of information (e.g., data sensitivity classification, authentication mechanism) wasn’t present in the ticket, the threat model would call it out explicitly as unknown rather than assume a value.
Setting up handoff context — rather than producing a complete-looking but fabricated threat model, the agent would output a partially complete model with clear notes on what a security engineer would need to resolve to assess each flagged risk.
Framing incompleteness as actionable — the agent’s output was designed to read: “if we can answer this question, we can determine whether this risk is mitigated or not.” That framing transforms a hallucination guard into a useful next-step cue.

This design shift reflects a core principle: the agent doesn’t need to produce a perfect output on every run. It needs to produce a trustworthy output — one a security engineer can act on without first having to audit it for fabrications.

Garbage Input Ticket Hallucination: Teaching the Threat Model Agent to Say “I Don’t Know”

Proof of Concept

Observe the failure mode on a garbage input ticket: A security review ticket is submitted with minimal specifics — no description of data sensitivity, no mention of data transport protocols, no authentication details, and no concrete system context. The ticket is functionally empty of the information a real security engineer would need.
Run the unmodified threat model agent against the sparse ticket: The LLM, trained to produce a complete threat model, fills the information vacuum by hallucinating. It generates findings referencing concepts like encryption, key rotation, and authorization that are not mentioned anywhere in the ticket. The output looks superficially complete but is entirely fabricated — a threat model for a system that was never described.
Identify the root cause — hallucination under uncertainty: The core problem is that the agent’s prior prompt design rewarded producing a full, structured threat model regardless of input quality. With insufficient grounding data, the LLM did what LLMs do: generated statistically plausible content based on common security patterns, not the actual ticket.
Apply prompt engineering to enforce epistemic honesty: The prompt is modified to instruct the agent to act like a security engineer who is comfortable saying “I don’t know.” Concretely, the agent is instructed to: identify risks where the ticket provides insufficient information to determine exploitability or mitigation status; mark those risks with a status of unknown rather than asserting a finding; output the specific question (or questions) that, if answered, would allow the risk to be properly assessed. For example: “If we can determine whether this service encrypts data in transit, we can assess whether this risk is mitigated.”
Validate the fix against the garbage input ticket: The revised agent, when run against the same sparse ticket, no longer fabricates specifics. Instead it produces a threat model that calls out the information gaps, lists the risks that cannot be assessed given the available context, and sets the security engineer up for a productive follow-up conversation — acting as a structured starting point rather than a hallucinated conclusion.
Connect to the human-in-the-loop and conversational agent handoff: The output from this corrected behavior is intentionally designed to feed into Stripe’s conversational agent for follow-up. The threat model agent’s acknowledgment of unknowns — with the reasoning and open questions preserved — provides the necessary context and “proof of work” for the conversational agent to continue the review productively, rather than inheriting fabricated assumptions.

The Prompt Over-Specification Trap

A subtler hallucination-adjacent failure mode emerged during prompt iteration: over-specifying prompt instructions silently degrades overall accuracy. This was demonstrated through a concrete example in the talk.

The team added instructions to the threat modeling prompt directing the model on how to properly format a JSON output. On individual test runs, the change looked great — the model produced correctly structured JSON with security guidance included. But when run through the full evaluation pipeline against the golden test set, overall accuracy had dropped by 10%.

The cause: the model was now allocating more of its attention to satisfying the JSON formatting requirement and less to generating accurate security findings. The prompt had inadvertently shifted the model’s optimization target.

This is why spot-checking prompt changes is not sufficient for production security agents. A change that looks correct on three or four manual runs can still be silently degrading performance across the full distribution of inputs — a problem that only becomes visible when you have an eval pipeline measuring accuracy at scale.

Prompt Specificity as a Double-Edged Tool

This tension between specificity and generalization runs through the entire guardrails design philosophy at Stripe. On one hand, specificity helped: adding explicit security domains for the model to always consider (authorization, single sign-on) improved accuracy by roughly 10%, because it gave the model concrete anchors for what a complete threat model should cover.

On the other hand, over-specification in the wrong dimension (output format rather than security reasoning) actively harmed accuracy. The lesson is directional: prompt specificity should target the reasoning behavior you want, not the output mechanics. Telling the model to think like a security engineer improves security output. Telling the model how to format JSON shifts focus away from security reasoning.

Guardrails as System Design, Not Afterthought

What ties these observations together is a systems-design framing: guardrails aren’t a layer you bolt on after the agent is working. They are decisions about what the agent is allowed to say, how it handles uncertainty, and what counts as a complete versus incomplete output.

For teams building hallucination mitigation into security AI:

Define what “I don’t know” looks like in your output schema and make it a valid, expected response.
Treat prompt changes as potentially breaking changes — measure them against your full eval set, not just spot checks.
Distinguish between specificity that improves reasoning (security domain anchors) and specificity that crowds out reasoning (formatting directives).

Actionable Takeaways

Build an explicit "unknown" state into your threat model output schema. When the agent lacks sufficient input to assess a risk, it should surface the missing field and explain what answering it would resolve — this converts hallucination guards into actionable handoff notes for security engineers.
Treat every prompt change as a potentially breaking change and gate it against your full golden test set before merging. Spot-checking on a few manual runs is insufficient — silent accuracy regressions from well-intentioned changes (like adding JSON formatting instructions) are only detectable at scale.
When adding specificity to a security agent prompt, target the reasoning behavior (e.g., "always consider authorization and SSO") rather than output mechanics (e.g., "format your response as JSON with these fields"). Specificity that improves domain coverage helps; specificity that redirects attention to formatting hurts.

Common Pitfalls

Hallucinating on sparse input tickets: when a ticket lacks key details, the LLM will fabricate plausible-sounding but ungrounded findings (e.g., inventing encryption risks not mentioned anywhere). The fix is prompt-level instruction to flag missing information as unknown rather than infer values.
Over-specification prompt rot: adding formatting or structural directives to a prompt can silently shift the model's attention away from the core task. A prompt that produces correctly formatted output on individual runs may still be underperforming across the full input distribution — and this is only detectable with an eval pipeline.

Phased Rollout and Human-in-the-Loop for AI Security Agents

Phased Rollout: Starting in Shadow Mode

Deploying AI security agents in production is not a single event — it is a staged progression designed to build confidence before any output reaches end users. Stripe’s approach to rolling out the threat modeling agent illustrates a disciplined deployment ladder that security teams can directly replicate.

The first critical decision was domain scoping. Rather than deploying across all security review categories immediately, the team deliberately selected a constrained subcategory of security reviews — specifically those with highly similar risks and mitigations — where the agent was most likely to perform consistently. This narrowing of scope is not a limitation; it is a deliberate risk reduction strategy. Reviews with predictable, repeating risk patterns create a more stable target for automation and reduce the likelihood of hallucinated or irrelevant findings surfacing to engineers.

With the subcategory selected, the agent was deployed entirely in shadow mode. During this phase:

The agent runs against incoming real tickets in parallel with the human process.
Its output is never shown to end users.
Every output is evaluated against the LLM-as-judge pipeline using golden standard test cases.
The team iterates on prompt design and agent architecture based on accuracy scores without any production exposure.

This shadow mode phase is the right place to invest heavily in the evaluation pipeline — not after deployment, when the cost of errors is visible to end users.

Setting the Right Accuracy Threshold

One of the most practically useful insights from this deployment is how the required accuracy threshold depends directly on the failure mode of your agent’s output path.

Two deployment models were compared:

Fully autonomous output: If the threat modeling agent directly sends its findings as final threat models to engineering teams — without human review — then any inaccuracy is immediately visible and disruptive. Engineering teams receive noisy, incorrect security guidance, lose trust in the system, and escalate. In this model, a very high accuracy threshold is required before launch.

Human-in-the-loop output: If a security engineer reviews the agent’s output before it reaches engineering teams — confirming which threats are applicable and dismissing those that are not — the acceptable accuracy threshold drops significantly. The agent no longer needs to be perfect; it needs to be useful. Getting an engineer 90% of the way there is enough if the human makes the final call.

Stripe chose the human-in-the-loop model, targeting approximately 80% accuracy at initial launch and committing to continued iteration once the agent became visible to users. This is a deliberate and defensible choice: the human review gate absorbs the remaining uncertainty while still delivering material time savings on the security review queue.

Output Format Differentiation for Different Audiences

A practical deployment challenge that emerged was that different consumers of the threat model require different output formats. The agent’s internal semantic representation — optimized for the LLM-as-judge evaluation pipeline — was not directly suitable for all downstream use cases.

Three distinct output formats were identified and supported:

Human-readable summary: Security engineers reviewing the output need a concise, narrative format that highlights the most relevant risks and mitigations clearly. Dense structured data does not serve this audience.
MITRE-structured format: Stripe’s internal threat modeling tool expected output mapped to the MITRE framework for metrics and tracking purposes. The agent’s output agents were configured to produce this structured format for tool consumption.
Conversational handoff format: For threat models on vague or underspecified review categories, the output is intentionally designed as a starting point for a follow-on conversation with a conversational AI agent. The key design requirement here is that the handoff payload must include sufficient context and reasoning — essentially proof of work from the threat modeling agent — so the conversational agent can continue effectively without re-processing the original ticket.

This output differentiation pattern is handled by dedicated output agents in the sequential pipeline, which receive the core analysis from security agents and render it into the appropriate format for each target audience.

Handling Incomplete Information Explicitly

A final production consideration was how the agent should behave when it encounters a ticket with insufficient information to complete a full threat model. The initial behavior — hallucinating risks based on assumed context — recreated the core problem the agent was meant to solve.

The correct behavior, modeled on how an experienced security engineer would respond, is to:

Explicitly flag where information is missing rather than filling in gaps with assumed context.
Mark the status of unknown risks clearly, so the reviewing engineer knows which findings are grounded in the ticket versus which require follow-up.
Frame the missing information as actionable questions — e.g., “If we can confirm the authentication mechanism, we can determine whether this risk is mitigated or not.”

This approach preserves the value of the threat model even when the input is incomplete. It sets the security engineer up for a productive next step rather than sending them back to verify whether the AI’s assumptions were valid.

Progressive Bubble Rollout for the Security Routing Agent

The security routing agent followed a different deployment ladder suited to its open-ended, conversational nature. Because security routing handles a much wider range of possible inputs than threat modeling, it relied more heavily on real user feedback than on offline accuracy metrics.

The rollout proceeded in three phases:

Internal web page access: The agent was made available via a dedicated web page accessible to team members and any Stripe employee who sought it out. This low-friction entry point allowed the team to gather initial feedback from engaged early adopters without broad exposure.
Slack invocable: After iterating on web page feedback, the agent was integrated into Slack as an invocable command. This tested context-dependence — how the agent handled questions embedded in real engineering workflows — and expanded the feedback pool to a broader internal audience.
Developer chat UI integration: Following Slack validation, the agent was released to all developers through Stripe’s internal chat UI for agents. This is “meeting developers where they are” — integrating the tool into the environment developers already use rather than requiring behavior change.

Throughout all three phases, the team actively ran demos and outreach to solicit direct feedback, compensating for the inherent difficulty of measuring open-ended routing accuracy through automated pipelines alone.

Actionable Takeaways

Deploy AI security agents in shadow mode first — run the agent against real tickets in parallel with the human process, evaluate every output against your golden test set, and iterate on accuracy before any output reaches end users. Do not expose agent output until you are confident in the accuracy threshold.
Set your accuracy threshold based on your output path, not an abstract quality bar. If a human reviews agent output before it reaches engineering teams, you can launch at ~80% accuracy and iterate in production. If the agent outputs directly to engineers without review, you need a much higher bar — and the human-in-the-loop gate is the mechanism that makes early launch viable.
Design your agent to handle incomplete input like a security engineer — explicitly flag missing information, mark unknown-status risks, and frame gaps as actionable questions. Never allow the agent to hallucinate assumptions into a threat model to fill in sparse tickets.

Common Pitfalls

Deploying the agent with fully autonomous output before it reaches a high accuracy threshold. Without a human review gate, the first wave of inaccurate findings will generate noise for engineering teams, erode trust in the system, and require significant cleanup effort — recreating the exact problem the agent was meant to solve.
Failing to differentiate output formats for different downstream consumers. A single output format optimized for LLM evaluation will not serve human reviewers, internal tooling with MITRE expectations, and conversational agent handoffs equally. Output agents that render the core analysis into audience-appropriate formats are a production necessity, not an enhancement.

Lessons Learned Shipping AI Agents for Security Teams

After shipping both the threat modeling agent and the security routing agent at Stripe, the team accumulated a set of hard-won lessons that cut across architecture, evaluation, and operational discipline. For security engineers planning their own AI security agents in production, these learnings are the difference between a maintainable system and one that slowly degrades into an unmanageable mess.

Alpha Evolve Prompt Optimization Does Not Work for Open-Ended Language Tasks

One of the more counterintuitive findings was that Alpha Evolve^[3] — Google DeepMind’s tool for evolving prompts through a natural-selection-style iteration process — failed to produce meaningful improvements for the threat modeling agent.

The theory is sound: take a base prompt, generate variations, score each variation through the eval pipeline, keep the best performer, repeat until the prompt is fully optimized. This works well for computational and mathematical problems where there are a finite number of meaningful permutations to explore.

For open-ended language tasks, the failure mode is that the mutation engine doesn’t know how to make semantically meaningful changes. In practice, the variations Alpha Evolve generated either:

Added two words without changing the semantic meaning of the prompt at all
Paraphrased the entire prompt — again, no net semantic change

Neither type of mutation produced guidance improvements that would change what the agent actually outputs. The lesson: automated prompt evolution tools built for algorithmic domains don’t generalize to language task optimization — at least not within cost constraints. Manual, hypothesis-driven prompt engineering guided by your eval pipeline is more effective.

Alpha Evolve Prompt Optimization Applied to Security Agent Prompts

Proof of Concept

Baseline prompt established: The team started with a working base prompt for their threat modeling agent, which had already been tuned using their LLM-as-judge evaluation pipeline and golden standard test cases. This served as Generation 0 for the Alpha Evolve process.
Mutation generation initiated: Alpha Evolve was used to generate multiple variations of the base prompt. The tool’s natural-evolution mechanism produced mutations by making small changes to the prompt text — analogous to genetic mutation in biological evolution — with the goal of discovering a version that scored higher through the evaluation pipeline.
Variation 1 — Trivial word addition: The first mutation produced by Alpha Evolve added two words to the existing prompt. The addition did not semantically alter the intent or behavior of the prompt. No meaningful change in how the agent reasoned about security risks was introduced. This mutation failed to improve accuracy in any measurable way.
Variation 2 — Full paraphrase: The second mutation paraphrased the entire prompt — rewording sentences while preserving the same semantic content. Again, no new security guidance, no restructured reasoning strategy, and no additional context was introduced. The core meaning and instruction set remained identical to the original.
Evaluation pipeline scoring: Each mutation was scored against the golden standard test cases using the LLM-as-judge evaluation pipeline. Neither mutation produced a meaningful delta in semantic equivalence scores versus the baseline prompt. The evolutionary process stalled because there was no fitness gradient to climb — the mutations were too semantically similar to the parent to produce differentiated outcomes.
Root cause identified — open-ended language problem: The team concluded that Alpha Evolve’s strength lies in constrained computational and mathematical search spaces, where a finite set of permutations exists and even small changes can produce measurably different outputs. Natural language prompts are unbounded: the space of semantically meaningful prompt variations is enormous, and random mutations are overwhelmingly likely to be semantically equivalent to the original. Alpha Evolve cannot distinguish a truly novel prompt strategy from a trivial paraphrase.
Cost constraint compounds the failure: Even if longer evolutionary runs might eventually surface useful mutations, the cost of running each variation through the evaluation pipeline (LLM inference costs for both the agent and the judge) made it economically impractical to run the large number of generations required to escape the local optimum.
Outcome — manual prompt engineering retained: The team abandoned Alpha Evolve for prompt optimization and reverted to human-guided prompt iteration: using low-scoring test cases from the golden set to identify specific failure modes, then crafting targeted prompt changes (such as explicitly listing security domains like authorization and SSO to always consider). This manual approach yielded approximately 10% accuracy gains per iteration — a result that automated prompt evolution failed to produce.

Invest in Your Eval Pipeline Early — Before You Need It

The JSON formatting example is the clearest illustration of why investing in your LLM evaluation pipeline early matters. When the team added a prompt section describing how to properly format JSON output, individual spot checks looked fine — the agent was producing correctly structured JSON and still giving security advice. Without the eval pipeline, that change would have shipped.

With the eval pipeline in place, the data showed a 10% reduction in overall accuracy. The agent was now prioritizing JSON formatting instructions over the quality and completeness of the actual security guidance it generated.

The broader anti-pattern this reveals: without an eval pipeline, prompt changes optimize for edge cases that are visible and easy to check, while silently degrading performance on the general case. Over time, this accumulates into prompt rot — a bloated, over-specified prompt that handles a handful of known scenarios well but generalizes poorly.

The recommendation is direct: build your golden standard test cases and your LLM-as-judge scoring mechanism before you start iterating on prompts. Every prompt change should be validated against the full test set, not just the cases that motivated the change.

Agent Architecture Must Match the Task Structure

A key architectural lesson is that there is no one-size-fits-all agent structure — the architecture must match the inherent structure of the task.

For threat modeling, a multi-step sequential child agent architecture worked well because:

The task is long-running and async — response latency is acceptable
The workflow has a natural sequential dependency (input context → specialized security analysis → formatted output)
Predictability matters — you always want input agents to run first, security agents in parallel, output agents last
Fully autonomous orchestration led to the orchestrator skipping specialized agents it should have run

For security routing, a simpler agentic structure with minimal tools worked better because:

The task space is open-ended — any question could be a routing request
A sequential pipeline would be over-engineered for what amounts to a research-and-classify task
The iterative tool pruning process (starting broad, then removing tools one by one while monitoring accuracy) got the runtime from 10 minutes down to 30 seconds with no meaningful accuracy loss

The design heuristic: start by characterizing whether your task has natural sequential dependencies and whether predictability or flexibility is the higher priority — then choose the architecture accordingly.

Garbage Input Produces Garbage Output — Teach the Agent to Say “I Don’t Know”

The final and arguably most operationally important lesson is the garbage-in/garbage-out problem for AI threat modeling.

When the threat modeling agent received vague, sparse input tickets — tickets that lacked specifics about data sensitivity, transport protocols, authentication mechanisms, or system boundaries — the initial behavior was to hallucinate. The agent would generate threat models that referenced encryption, key rotation, and other specific risks that simply were not mentioned in or implied by the ticket. This is exactly the failure mode that erodes trust in AI security outputs and forces security engineers back into the loop to clean up fabricated findings.

The fix required an explicit behavioral change in the agent’s prompting: teach it to act like a security engineer, not like an LLM that must always produce an answer. A security engineer reviewing a vague ticket would say “there’s not enough information here to assess this risk” — not invent plausible-sounding threats to fill the gap.

Concretely, the agent was updated to:

Flag missing information explicitly in the output
Mark the status of risks as “unknown” when the input doesn’t support a determination
Set up the security engineer to ask the right follow-up questions rather than acting on fabricated findings

This also connects to the human-in-the-loop requirement: even at 80% accuracy, humans must review AI-generated threat models before they are acted on. The risk of a hallucinated threat model being treated as authoritative is too high — especially when the input ticket quality is variable and outside the agent’s control.

Actionable Takeaways

Do not use Alpha Evolve or similar automated prompt mutation tools for open-ended language tasks — they generate syntactic variations without semantic changes and will not improve agent performance within realistic cost budgets. Use hypothesis-driven manual prompt engineering guided by your eval pipeline instead.
Build your LLM evaluation pipeline with golden standard test cases before you begin iterating on prompts. Every prompt change must be scored against the full test set — not just the cases that motivated the change — to detect silent accuracy regressions like the JSON formatting example.
Explicitly train your threat modeling agent to handle sparse input by expressing uncertainty rather than hallucinating findings. Prompt it to flag missing information, mark unknown risks, and surface the questions a security engineer would need answered — rather than generating plausible-sounding but fabricated threats.

Common Pitfalls

Relying on spot checks and individual run quality instead of an eval pipeline to validate prompt changes. A change that looks correct on a handful of cases can reduce overall accuracy by 10% or more — as the JSON formatting prompt demonstrated — while appearing fine in manual review.
Assuming a fully autonomous orchestrator will always invoke the right specialized agents. When given too much agency, the orchestrator agent at Stripe skipped relevant specialized security agents. Predictable sequential pipelines with defined execution order are more reliable than fully autonomous orchestration for structured security workflows.

Conclusion

Stripe’s journey to production AI security agents reveals a clear through-line: reliability in AI security automation comes from treating every component of the system as a design decision, not a default. The choice between sequential and agentic pipelines, the investment in a golden standard evaluation framework, the deliberate use of shadow mode and human review gates — each decision was grounded in the specific failure modes and latency requirements of the task.

The most transferable insight is the eval pipeline as a production control gate. Building human-curated golden test cases and LLM-as-judge scoring before touching the prompt is the single intervention that catches silent regressions, enables model selection, and prevents prompt rot. Without it, AI security agents drift toward unreliability in ways that are invisible until they surface as user trust failures.

For security teams evaluating their own AI agent programs, the question isn’t whether to build evaluation infrastructure — it’s when. Stripe’s answer, borne out by the JSON formatting regression and the Alpha Evolve failure: build it first.

For deeper coverage of adjacent topics, see:

AI security talks and technical breakdowns on The Cyber Archive
Agentic AI security research and production lessons
LLM evaluation techniques and frameworks for security applications

References & Tools

MITRE ATT&CK Framework — Structured adversarial tactics and techniques framework used as an output format for threat model findings consumed by Stripe's internal threat modeling tool. ↩
LLM as Judge — Evaluation methodology using a language model to score semantic equivalence between agent-generated outputs and golden standard expected outputs, anchored in human-curated test cases. ↩
Alpha Evolve — Google DeepMind tool for evolving algorithms and prompts via natural-selection-style iteration; found ineffective for open-ended language prompt optimization due to lack of meaningful semantic variation in generated mutations. ↩

Multi-Agent Architecture for Security Automation

The Core Design Decision: Async vs. Conversational

Threat Modeling Agent: Sequential Multi-Agent Pipeline

Why Sequential, Not Autonomous?

Balancing Determinism and LLM Flexibility

Internal Guidance Tools as a Force Multiplier

Security Routing Agent: Iterative Tool-Use Architecture

Security Routing Agent Evolution: From One-Step LLM Call to Two-Tool Agentic Structure

Architecture Depends on the Task

Building an LLM Evaluation Pipeline for Security Agents

Why Deterministic Scoring Fails for Threat Modeling

LLM-as-Judge: Breaking the Cyclical Dependency

Using the Eval Pipeline to Gate Production Changes

JSON Formatting Prompt Change That Silently Reduced Threat Model Accuracy by 10%

The Security Routing Evaluation Approach

Accuracy Thresholds and the Human Review Gate

Scaling the Golden Set with Test Case Duplication

Guardrails Design: Hallucination Mitigation and Input Quality

Hallucination as a Self-Defeating Failure Mode

Teaching the Agent to Say “I Don’t Know”

Garbage Input Ticket Hallucination: Teaching the Threat Model Agent to Say “I Don’t Know”

The Prompt Over-Specification Trap

Prompt Specificity as a Double-Edged Tool

Guardrails as System Design, Not Afterthought

Phased Rollout and Human-in-the-Loop for AI Security Agents

Phased Rollout: Starting in Shadow Mode

Setting the Right Accuracy Threshold

Output Format Differentiation for Different Audiences

Handling Incomplete Information Explicitly

Progressive Bubble Rollout for the Security Routing Agent

Lessons Learned Shipping AI Agents for Security Teams

Alpha Evolve Prompt Optimization Does Not Work for Open-Ended Language Tasks

Alpha Evolve Prompt Optimization Applied to Security Agent Prompts

Invest in Your Eval Pipeline Early — Before You Need It

Agent Architecture Must Match the Task Structure

Garbage Input Produces Garbage Output — Teach the Agent to Say “I Don’t Know”

Conclusion

References & Tools

Questions from the audience

Related deep dives

Kinetic Risk: Securing and Governing Physical AI in the Wild | [un]prompted 2026

Securing Workspace GenAI at Google Speed | [un]prompted 2026

The AI Security Larsen Effect - How to Stop the Feedback Loop | [un]prompted 2026

Glass-Box Security: Operationalizing Mechanistic Interpretability | [un]prompted 2026