The Cyber Archive

FENRIR: AI Hunting for AI Zero-Days at Scale | [un]prompted...

Discover how Trend Micro's FENRIR engine chains SAST tools, fast LLM triage, and agentic sandboxes to find 60+ CVEs at $8.80 per true positive.


Peter Girnus & Derek Chen presenting talk - FENRIR: AI Hunting for AI Zero-Days at Scale at unprompted 2026
Peter Girnus & Derek Chen presenting talk - FENRIR: AI Hunting for AI Zero-Days at Scale at unprompted 2026

AI zero-day vulnerability discovery is no longer a research experiment — while your team manually triages 500 static analysis alerts per week, autonomous LLM agents are quietly churning through entire open-source ecosystems and finding critical CVEs that human researchers miss. FENRIR, Trend Micro’s production zero-day engine, has already submitted over 60 CVEs — all high or critical — with 100 more in pre-disclosure and 3,000 pending review.

For security engineers who want to understand how AI-driven vulnerability research actually works at production scale, this post breaks down FENRIR’s three-stage cascade architecture: from YARA/CodeQL signal collection, through fast LLM false-positive pruning, to deep agentic sandbox verification with auto-generated disclosure packages — and the hard-won lessons about model sizing, force-reflection, and context efficiency that make it reliable.

Key Takeaways

  • You'll learn how to architect a multi-stage cascade pipeline that combines traditional SAST tools with LLM triage to eliminate over 60% of false positives before any expensive model inference — drastically reducing cost and analyst fatigue.
  • You'll be able to apply a deliberate asymmetry strategy: bias your fast L1 triage toward recall (never drop true positives) and reserve heavyweight agentic reasoning for the small subset of candidates that survive — keeping $8.80 per true positive economically viable.
  • Apply force-reflection and reachability-first filtering in your agentic verification stage to prevent shallow LLM reasoning, reduce hallucinations, and surface only high/critical CVEs with full auto-generated disclosure packages ready for vendor submission.

The FENRIR Pipeline Architecture: How AI Zero-Day Discovery Works at Scale

AI zero-day vulnerability discovery at production scale requires more than feeding a repository into an LLM and hoping for results. FENRIR, Trend Micro’s zero-day vulnerability discovery engine, is built on a deceptively simple principle: collect signal cheaply, denoise it aggressively, and only spend real compute on candidates that survive that gauntlet. The result is a system that has submitted over 60 high/critical CVEs — with 100 more in pre-disclosure and 3,000 pending review — while keeping costs economically viable.

The Three-Stage Cascade: Signal → Triage → Verify

FENRIR three-stage cascade pipeline: SAST signal collection, L1 LLM triage, and L2 agentic sandbox verification with cost annotations

FENRIR’s pipeline is organized into three discrete stages, each with a specific responsibility and cost profile:

Stage 1 — Static Analysis (Signal Collection) The pipeline begins with traditional deterministic tooling. YARA-X[1], Semgrep[2], CodeQL[3], and SpotBugs[4] each scan the target repository for different classes of vulnerability signal. This stage is intentionally cheap — no LLM tokens are spent here, and tools like YARA-X can scan millions of lines of code in seconds. The goal is broad signal collection, not precision. Raw findings at this stage number in the hundreds.

Stage 2 — L1 LLM Triage (Denoise and Refine) Surviving candidates move to a fast LLM triage layer. This stage makes no claims of verification — its sole job is to prune obvious false positives before anything expensive runs. A single LLM inference call with a pre-allocated 50-line context window eliminates over 60% of findings. The stage is deliberately biased toward recall: it is acceptable for a few false positives to survive, but dropping a true positive here is not an acceptable outcome.

Stage 3 — L2 Deep Agentic Verification (Classify and Present) The survivors of L1 triage reach the most expensive stage. A heavyweight agentic system — Claude Opus[5] deployed into an isolated secure sandbox — performs deep, multi-turn analysis with full code context, execution privileges, and built-in force-reflection. This is where a “maybe” becomes a verified “yes” or “no” with supporting evidence. The output is a high-confidence finding with an auto-generated disclosure package ready for human review.

The entire flow reduces 500 raw SAST alerts down to 10–25 high-confidence true positives that a human researcher can validate, assess for exploitability, and submit to a vendor.

The Bidirectional Intelligence Loop: FENRIR and Mimir

FENRIR and Mimir bidirectional intelligence loop: zero-day discovery feeds n-day tracking which re-triggers FENRIR on patch release

FENRIR does not operate in isolation. It sits within a broader unified platform that includes Mimir, a component focused on n-day vulnerability research and threat intelligence. The two systems form a bidirectional intelligence feedback loop that compounds the value of each discovery.

The NVIDIA Isaac robotics framework command injection vulnerability illustrates how this loop works in practice. When FENRIR discovered and submitted that zero-day to NVIDIA, Mimir immediately began tracking it as an n-day. When NVIDIA released the patch, 23 autonomous LangGraph[6] agents on the Mimir side automatically retrieved the advisory page and the exact code commit, then re-scanned the repository. FENRIR was then able to process that updated codebase and discover two additional bugs: one newly introduced regression and one patch bypass. A single zero-day submission seeded the intelligence pipeline that found two more.

This feedback loop matters because the value of AI-driven zero-day discovery is not purely in the initial find — it compounds as patched code generates new attack surface that autonomous agents can immediately re-examine.

Why Separating Cheap Signal from Expensive Inference Is the Core Architectural Insight

The most important design decision in FENRIR is not any individual tool or model choice — it is the strict separation of signal collection from LLM inference, and the aggressive filtering between every stage.

Teams that skip this separation and feed entire repositories directly to an LLM encounter two predictable problems: the model returns enormous volumes of noise (most of which are not vulnerabilities), and the token cost scales with repository size rather than finding quality. FENRIR inverts this by using deterministic tools to create a precisely targeted context window before any LLM call is made.

The cascade architecture — quick elimination up front, deep analysis only for survivors — is what makes the cost model work. By the time a finding reaches the $0.61 median-cost L2 agentic stage, it has already survived two prior filters. The expensive reasoning is reserved for the small population of candidates where that reasoning actually changes the outcome.

Origins and Evolution: From MCP Server to Production Agentic System

FENRIR began in early 2025 as an MCP (Model Context Protocol)[7] server component — a natural starting point, given that MCP had just been released and represented a clean transition from direct tool-calling to agent-oriented architectures. The system was designed from the start with context efficiency and cost as first-class constraints, since the early reasoning models available at that time had limited context windows and high per-token costs.

As context windows expanded past one million tokens and more capable reasoning models entered production, FENRIR evolved to take advantage of those capabilities — particularly at the L2 agentic stage. The architecture has survived multiple generations of underlying LLMs because the pipeline itself is model-agnostic: each stage specifies the capability class it needs (fast and cheap, or powerful and thorough), not a specific model version. That design decision is what allows the system to improve as the underlying models improve without requiring a full redesign.

Actionable Takeaways

  • Architect your AI-assisted vulnerability pipeline as a cascade with explicit cost gates: run deterministic SAST tools first to generate broad signal at near-zero cost, apply a fast recall-biased LLM filter to eliminate 60%+ of findings, and reserve expensive multi-turn agentic reasoning only for the small surviving set where deep analysis changes the outcome.
  • Build a bidirectional intelligence loop between your zero-day and n-day workflows: when a zero-day is submitted and a vendor patches it, automatically re-scan the patched repository. Patch regressions and bypass vulnerabilities are consistently discoverable this way and represent high-value, low-competition targets.
  • Design your pipeline stages around capability classes rather than specific model versions. Specify that Stage 2 requires "fast and recall-biased" and Stage 3 requires "best security domain knowledge and agentic capability" — this allows you to swap models as the ecosystem evolves without redesigning the pipeline logic.

Common Pitfalls

  • Feeding entire repositories directly to an LLM without a prior deterministic filtering stage produces overwhelming noise. Most signals returned are not vulnerabilities, and the cost scales with repository size rather than finding quality. The cascade architecture exists precisely to prevent this failure mode.
  • Treating the L1 triage stage as a verification step rather than a noise-reduction step leads to incorrect architectural decisions — such as using an overly large model at L1 or requiring high precision when the stage should be optimized for recall. L1's only job is to not drop true positives; false positives surviving L1 are expected and handled downstream.

SAST Toolchain and Multi-Scanner Correlation: Building the Signal Layer

The Case for Deterministic Signal Before LLM Inference

The most consequential design decision in AI zero-day vulnerability discovery is what happens before a single token is sent to a language model. FENRIR’s static application security testing layer is not a preprocessing step — it is the primary cost-control mechanism for the entire pipeline. Every false positive eliminated here avoids an LLM inference call that can cost anywhere from a few cents at L1 to hundreds of cents at L2 agentic triage.

The core insight the FENRIR team validated is simple: giving an LLM an entire repository and asking it to “find vulnerabilities” does not work. The model returns enormous volumes of noise, most of which are not real vulnerabilities, and the context window fills with irrelevant code. Traditional SAST tools, by contrast, are deterministic, fast, and designed to surface structured signals — specific line numbers, CWE classes, and code patterns — that give LLMs the precise context they actually need.

The Four-Tool Cascade

FENRIR’s static analysis layer is deliberately structured as a cascade, not a parallel scan. Each tool in the chain is faster and broader than the next, and the output of each stage feeds the next as a narrowed candidate set.

1. YARA-X — The Fastest Pre-Filter

YARA-X is the entry point in the cascade. Its defining characteristic is raw scanning speed: it can process millions of lines of code in a matter of seconds. FENRIR uses it as a broad initial sweep to generate minute signals — pattern matches that indicate potentially interesting code regions. The FENRIR team has previously used YARA-based rules to find real vulnerabilities in the wild, establishing confidence in it as a production-grade pre-filter.

The tradeoff is precision. YARA-X is intentionally imprecise at this stage; its job is not to confirm vulnerabilities but to eliminate the vast bulk of code that cannot possibly be interesting. Speed is the priority.

2. Semgrep — Pattern Matching With Rule Set Precision

After YARA-X narrows the candidate set, Semgrep runs against the filtered results. Semgrep is significantly more precise than YARA-X, particularly when backed by a large, well-maintained rule set. It operates on syntax-aware pattern matching and can express more complex structural conditions in code.

The cascade design here is deliberate: Semgrep would be slower and more resource-intensive if run against a full repository cold. By receiving a pre-filtered candidate set from YARA-X, it operates at much higher efficiency while delivering better signal quality.

3. CodeQL — Data-Flow and Taint Analysis

CodeQL performs taint analysis and data-flow tracking — the most compute-intensive analysis in the static layer. Unlike YARA-X and Semgrep, which primarily match patterns, CodeQL can follow how untrusted input propagates through a codebase, crossing function boundaries and tracking variable assignments across files.

This is where the cascade architecture pays off most clearly. CodeQL’s depth comes at a cost: it is substantially heavier than either YARA-X or Semgrep. Running it against millions of lines of unfiltered code would be prohibitively expensive. Running it against the narrowed candidate set from the previous two stages produces high true-positive conversion rates without wasted compute.

4. SpotBugs — Java Binary Analysis

SpotBugs (used via the FindSecBugs extension in FENRIR’s configuration) handles a specific gap in the toolchain: compiled Java binaries. When a target repository includes Java artifacts where source-level analysis is unavailable or incomplete, SpotBugs provides static analysis at the bytecode level.

This makes the SAST layer comprehensive across language surfaces, not just source code repositories.

Multi-Scanner Correlation: Turning Redundancy Into Confidence

Running four tools against the same codebase is not redundant — it is the foundation of FENRIR’s multi-scanner correlation algorithm, one of the features the team identifies as a key differentiator.

The correlation logic is straightforward but powerful: if two or more independent tools flag the same approximate location in code (within plus or minus 15 lines) for the same CWE class, that convergence is treated as high-confidence signal. The reasoning mirrors how experienced security researchers think — if independent analysis methods independently arrive at the same finding, the probability that it represents a real vulnerability increases substantially.

This cross-tool agreement serves two purposes:

  • Confidence scoring: Correlated findings are weighted more heavily in downstream prioritization.
  • Context enrichment: The correlation data becomes metadata attached to the finding as it progresses through L1 and L2 triage. The LLM triager at L1 does not receive a raw code snippet — it receives a finding that already carries information about which tools flagged it and why. This structured context improves the quality of the LLM’s judgment without requiring it to re-derive that context from scratch.

Zero Token Cost at This Stage

A critical operational point: the entire static analysis layer, including all four tools and the correlation pass, runs without spending a single token on LLM inference. Everything at this stage is deterministic, fast, and reproducible.

This is not incidental — it is a deliberate architectural constraint. The FENRIR team was explicit that context efficiency and cost control were primary design requirements from the moment they conceptualized the system. By exhausting deterministic signal collection before any LLM call, they ensure that when tokens are spent, they are spent on findings that have already survived multiple independent analytical filters.

The practical outcome: raw repository findings that number in the hundreds are reduced to a much smaller candidate set before L1 triage even begins, without any LLM cost whatsoever.

Actionable Takeaways

  • Structure your SAST toolchain as a cascade ordered by speed and compute cost — run the fastest, broadest tools first (YARA-X for pattern sweeps) and reserve expensive data-flow analysis (CodeQL) for the already-narrowed candidate set. This prevents wasted compute on unfiltered repositories.
  • Implement multi-scanner correlation by tracking which findings are independently flagged by two or more tools within the same CWE class and approximate code location (a 15-line window is a practical heuristic). Use correlated findings as higher-confidence inputs to downstream triage stages, and attach the correlation metadata as structured context for any LLM that processes them.
  • Enforce a strict separation between deterministic signal collection and LLM inference. Never send an LLM a raw repository or unfiltered finding list. The static analysis layer should reduce the candidate set to the point where every LLM call is operating on pre-validated signal — this is the primary mechanism for controlling inference cost at scale.

Common Pitfalls

  • Running heavy analysis tools like CodeQL against unfiltered repositories bypasses the cost-saving logic of the cascade entirely. If CodeQL is run first or in parallel without pre-filtering, the compute cost and time-to-completion scale with total codebase size rather than with the number of plausible vulnerabilities — negating most of the efficiency gains the cascade architecture is designed to deliver.
  • Treating multi-tool agreement as confirmation rather than signal elevation is a category error. The multi-scanner correlation algorithm increases confidence and prioritizes findings — it does not verify them. Skipping L1 and L2 triage for correlated findings would re-introduce false positives that the rest of the pipeline is designed to eliminate.

L1 LLM Triage: Asymmetric False-Positive Reduction Without Dropping True Positives

The Asymmetry Principle: Recall Over Precision at L1

The L1 triage stage is the most misunderstood component of FENRIR’s pipeline — and getting its design philosophy wrong is one of the most expensive mistakes a team can make. The L1 stage is not a decision-maker. It does not verify vulnerabilities. It does not confirm exploitability. Its sole mandate is to eliminate obvious false positives before they consume expensive multi-turn LLM compute at the L2 stage.

This deliberate asymmetry is the key insight: the L1 stage is biased toward recall. If it is uncertain, it must let the finding through. What it cannot do — under any circumstances — is drop a true positive. A missed vulnerability at L1 means it never reaches the agentic verification stage. Falsely discarding a real CVE is the worst possible outcome; tolerating a few extra false positives at L1 is entirely acceptable, because L2 will catch them.

FENRIR’s production evaluation confirms this design holds: through their evaluation harness, L1 has not dropped a single finding that L2 would have classified as a true positive. That is the bar. Build to that bar.

Pre-Allocated Static Context: Why 50 Lines Is Enough

One of the most practically impactful decisions in the L1 design is the pre-allocated 50-line static context window. Rather than feeding the LLM full file contents or entire repository context, the stage provides exactly 50 lines of code surrounding the flagged location.

This constraint is intentional and consequential:

  • Token cost drops dramatically. A single Claude Sonnet[8] call with 50 lines of context is a fraction of the cost of a multi-turn Opus session with full file access.
  • Latency decreases. Smaller context means faster inference, which means higher throughput across the full pipeline.
  • Precision improves. The 50-line window gives the model exactly what it needs to assess whether a finding is plausibly real — not so little that it lacks context, not so much that signal is buried in noise.

The earlier SAST stages (YARA-X, Semgrep, CodeQL) have already pinpointed the suspicious code location. By the time a finding reaches L1, you know the file, the line range, and the CWE class. The 50-line context captures that signal cleanly.

CWE-Specific Prompting: Narrow the Target

Generic prompting — “is this a vulnerability?” — is the wrong approach at L1. FENRIR uses CWE-specific prompts that ask a narrower question: “Can this possibly be CWE-79?” (or whichever class the static analysis flagged).

This matters because it:

  • Reduces the model’s output space. Instead of reasoning about all possible vulnerability classes, the model evaluates one specific class. The cognitive load is lower, accuracy improves.
  • Aligns with the upstream SAST signal. The static analysis chain already identified a CWE class. The L1 prompt reinforces that classification hypothesis rather than asking the model to derive it from scratch.
  • Produces consistently structured outputs. A narrower question yields a cleaner yes/no triage decision, which is exactly what the pipeline needs to route findings forward or discard them.

The lesson is generalizable: specific prompting outperforms generic prompting in security triage tasks. This was confirmed through FENRIR’s own evaluation harness.

Model Sizing: The Critical Tradeoff

Choosing the right model for L1 is not obvious, and the FENRIR team learned this through direct experimentation. The failure modes at both ends of the size spectrum are instructive.

Too small — Qwen3 0.6B: The team tried a compact 0.6B parameter model expecting it would be “quick” and serve as a cost-effective classifier. It failed completely. The model was constantly dropping true positives — exactly the behavior L1 must never exhibit. A model that lacks sufficient domain knowledge and reasoning capacity cannot reliably distinguish “obviously not a vulnerability” from “plausibly a vulnerability but hard to confirm.” It collapses borderline cases in the wrong direction.

Too large — Claude Opus: Opus would technically work. It has the security domain knowledge and reasoning capability to handle L1 triage accurately. But using Opus at L1 is a waste of resources. It is slower than Sonnet, limiting throughput across the pipeline. It costs significantly more per call. For a stage that processes hundreds of findings before expensive verification begins, the economics are wrong.

The sweet spot — Claude Sonnet: Sonnet delivers the right balance: fast enough to maintain throughput, capable enough to reliably catch true positives, and cheap enough that the cost savings at L1 justify the entire cascade architecture. The operational impact is substantial: over 60% of findings are eliminated by a single Sonnet call, preventing them from reaching the L2 agentic stage with its multi-turn tool calling and Opus-level costs.

The contrast the FENRIR team highlights is stark: 50 lines of code to a Sonnet call versus 12 turns of Opus with full context. For findings that are clearly false positives, that cost difference is orders of magnitude.

Operational Impact: What 60% Elimination Actually Means

The numbers tell the pipeline’s story. Starting from the static analysis output, L1 triage cuts the finding volume by more than half before any expensive inference occurs. In the live AON UI demo shown during the talk, 122 static analysis findings were reduced to a smaller candidate set by L1 — with all true positives preserved for L2 verification.

This reduction has compounding effects across the pipeline:

  • Cost reduction: Fewer findings entering L2 means fewer expensive Opus sessions with multi-turn tool calling. Given that L2 runs at a median of $0.61 per finding, eliminating 60%+ of candidates before L2 creates substantial savings at scale.
  • Throughput increase: Faster processing per finding at L1 combined with fewer findings entering L2 means the system can process more repositories in parallel.
  • Analyst focus: The human-in-the-loop review stage receives only high-confidence findings. The L1 filter is one of several steps that compress 500 raw SAST alerts down to 10–25 actionable, high-confidence true positives.

The FENRIR team characterizes this as a “no-brainer” — the asymmetric cost structure makes L1 optimization one of the highest-leverage improvements available in an AI-assisted vulnerability pipeline.

Actionable Takeaways

  • Design your LLM triage stages with explicit asymmetry: bias toward recall (letting borderline findings through) and reserve precision judgment for the more expensive downstream stage. Define "success" at L1 as zero true positive drop rate, not overall accuracy.
  • Use CWE-specific prompts at triage rather than generic vulnerability questions. If your SAST layer has already classified a finding by CWE class, your LLM triage prompt should ask "can this possibly be CWE-X?" — this narrower framing consistently outperforms broad queries.
  • Benchmark model sizing empirically before committing to a tier. Test both ends of the size spectrum (too small drops true positives; too large wastes cost and throughput). For a recall-biased fast triage task, a mid-tier model like Sonnet is likely the sweet spot — but verify with your own evaluation harness using known true positives as the test set.

Common Pitfalls

  • Using a model that is too small for the triage task. The FENRIR team found that a 0.6B parameter model (Qwen3 0.6B) constantly dropped true positives at L1 — the exact failure mode the stage must never produce. A model without sufficient security domain knowledge will collapse ambiguous borderline cases in the wrong direction, silently discarding real vulnerabilities.
  • Conflating the L1 triage stage with a verification or decision stage. L1 is not making any claim about whether a finding is a real vulnerability. Treating it as one — and raising the precision bar by excluding more findings — risks dropping true positives to save cost, which defeats the purpose of the entire cascade. The cost savings must come from volume reduction, not from accepting false negatives.

L2 Deep Agentic Verification: Sandboxed Reasoning and Force-Reflection

The L2 Stage: Where “Maybe” Becomes Evidence

After the L1 triage prunes over 60% of static analysis findings with a single fast inference call, the survivors reach L2 deep agentic verification — the most expensive, most capable stage in the FENRIR pipeline. This is the stage that turns ambiguous candidates into verified yes/no decisions backed by real execution evidence. As the presenters describe it: “This is where we actually slow down and get serious. We turn a maybe into a clear yes or no with evidence.”

The L2 stage uses Claude Opus as its reasoning engine, chosen deliberately for best-in-class security domain knowledge, reasoning depth, and agentic function-calling capabilities. The agent is deployed directly into an isolated, secure sandbox environment with full execution and write privileges. It can run Bash commands, write Python scripts, and execute them — whatever it needs to do to verify a finding, it can do. The architecture trusts the model to drive its own investigation without pre-allocated context or rigid tool sequences.

Reachability-First Filtering

The first thing the L2 agent does before investing in deep analysis is reachability triage. If the flagged code is unreachable — a dead code path, a documentation comment, a test fixture, or an unused function — it is not an exploitable vulnerability. The presenters are explicit: “If that code is not interesting, a documentation, a test case, unreachable code — that’s not reachable is not interesting from a vulnerability hunt perspective. So we filter that out immediately.”

This reachability analysis check prevents the agent from burning compute on code that could never be triggered by an attacker. It is a quick, logic-based gate that preserves both cost efficiency and analytical quality at the L2 stage.

Autonomous Context Collection

A key architectural insight in FENRIR’s L2 stage is that the agent collects its own context dynamically, rather than relying on pre-allocated static windows. At L1, context was constrained to 50 lines of code — a deliberate trade-off for speed and cost. At L2, the agent is given the full repository and autonomy to explore it.

The presenters explain why this works: “Since the agents now are super good, they are intelligent, they know domain knowledge, we can just let it roam. We can just ask it and let it collect context on its own. It knows what more context it needs. It knows how to trace data flow. It knows how to use cograph tools to build a call graph. Just let it roam. We don’t need to pre-allocate context.”

The agent autonomously:

  • Traverses call graphs to trace data flows across multiple files
  • Builds its own understanding of the attack surface
  • Uses available tools (Bash, Python execution, code graph utilities) to gather evidence
  • Determines on its own whether additional files or execution traces are needed

The result is multi-turn, multi-step reasoning that mirrors what a skilled human vulnerability researcher would do — but at machine speed and scale.

Force-Reflection: Keeping the Model Honest

One of the most operationally significant lessons from FENRIR’s production deployment is the force-reflection mechanism. Without it, the team observed a consistent failure mode: the model performs shallow reasoning and declares a finding a vulnerability without adequate rigor. The presenters describe this as the model “cheating” — doing just enough to produce a confident-sounding answer without fully interrogating the evidence.

Force-reflection directly counteracts this. The mechanism requires the agent to argue against its own conclusion before finalizing a verdict: “We have built-in multi-step reasoning and built-in reflection to keep the model honest. Sometimes what we are observing is the model kind of cheat. They do a really shallow reasoning and just call it a day. But this built-in reflection really reduces that because it has to fight itself on proving this is not a false positive.”

In practice, this means the agent must:

  1. Produce its initial analysis and tentative verdict
  2. Actively attempt to disprove its own finding — generating the strongest possible counter-argument for why this is not a vulnerability
  3. Resolve the conflict with final, evidence-backed reasoning

This adversarial self-interrogation pattern maps directly to how rigorous security researchers operate. The presenters also demonstrate a related capability in their live demo: a second LLM can be invoked to challenge the first agent’s findings, functioning like a bug bounty program reviewer who scrutinizes a submission by trying to find reasons to reject it. Hallucination mitigation in a production agentic security system requires this kind of structural pressure — the model must earn its confidence score.

Cost Model: $0.61 Median, $8.80 Per True Positive

L2 is expensive by design. The cost reflects the quality of what it produces. The presenters provide specific production numbers:

  • Median cost per finding: $0.61, corresponding to approximately 100,000 tokens in consumption
  • Cost per true positive: approximately $8.80
  • Complex cases (multi-file data flow in large repositories) can easily scale to over 1 million tokens

At $8.80 per verified, high-confidence true positive with a complete disclosure package, FENRIR’s L2 stage is dramatically more cost-effective than having a human researcher manually investigate every raw static analysis alert. The cascade architecture is what makes this viable — by the time a finding reaches L2, it has already survived SAST correlation and L1 fast triage. The L2 agent is not working through 500 noisy alerts; it is working through the small subset of candidates that have already earned their way through two prior filtering stages.

The cost asymmetry between L1 and L2 is intentional and structural. As the presenters note: “50 lines of code instead of 12 turns of Opus — that is really a no-brainer.” L1 handles volume; L2 handles depth.

Auto-Generated Disclosure Packages and Human-in-the-Loop

When L2 produces a verified positive finding, it does not simply output a boolean. It generates a complete disclosure package that includes:

  • Full vulnerability report with specific lines of code
  • Counter-argument (what would make this a false positive)
  • Exploitation impact assessment
  • Recommended remediation
  • Crafted PoC narrative

This package is what the human-in-the-loop reviewer receives. Their role in the workflow is sharply reduced: they verify exploitability, review severity, and submit to the vendor. They are not triaging 500 raw findings — they are reviewing 10 to 20 high-confidence, pre-analyzed reports with all supporting evidence already assembled.

The presenters frame this explicitly: “The human can just verify, validate, verify exploitability, review severity, and submit to the vendor.” FENRIR is designed as a force multiplier for human researchers, not a replacement. The L2 stage maximizes the quality of what the human sees; the human provides final judgment and accountability.

The dynamic confidence scoring system also feeds into this hand-off. Each verified finding carries a confidence level (e.g., medium, high) derived from the agent’s multi-turn reasoning. Researchers can interact directly with the L2 agent through a conversational interface — treating the LLM as a security researcher colleague — allowing humans to interrogate the reasoning before accepting a finding.

Weighted Context Allocation and Dynamic Priority Scoring

Two additional mechanisms operate within and around the L2 stage. The weighted context generation algorithm dynamically allocates tokens across different parts of the codebase based on relevance signals from earlier pipeline stages. Rather than spreading context budget uniformly across an entire repository, tokens are concentrated where the evidence points — maximizing analysis quality per dollar spent.

The dynamic priority scoring engine surfaces the most severe and most critical vulnerabilities for immediate disclosure. FENRIR focuses exclusively on high and critical CVEs — medium-severity findings are deliberately excluded. As the presenters state: “Those are all CVEs that are high or critical. We don’t really waste our time with the medium level severities. We use the other like script kiddies to find those. We’re looking for meaningful security impact.”

This severity-gating is not just a business preference — it is an architectural constraint that keeps human analyst bandwidth focused on findings with genuine organizational risk.

NVIDIA Isaac Robotics Framework: Command Injection Zero-Day Found by FENRIR

Proof of Concept

  1. Signal Collection via SAST Cascade: FENRIR ingested the NVIDIA Isaac robotics framework repository and ran it through the four-tool static analysis chain. YARA-X performed the initial broad sweep, flagging signals at speed. Semgrep applied more precise pattern matching, and CodeQL conducted deeper data-flow and taint analysis to track how user-controlled input flowed toward dangerous sinks.

  2. Multi-Scanner Correlation: When multiple tools independently flagged overlapping findings in approximately the same code location (within a ±15 line window) and attributed them to the same CWE class — consistent with a command injection pattern (CWE-77 or CWE-78) — the multi-scanner correlation algorithm elevated the confidence score.

  3. L1 LLM Triage (False-Positive Pre-Filter): The correlated finding was passed to the L1 triage stage, where Claude Sonnet evaluated a pre-allocated 50-line context window around the flagged code. The CWE-specific prompt asked whether the pattern could plausibly represent a command injection flaw — not whether it was definitively exploitable. The recall-biased judgment allowed the finding to pass through.

  4. L2 Deep Agentic Verification in Isolated Sandbox: The surviving candidate was routed to the L2 stage. Claude Opus was deployed into an isolated sandbox with full execution and write privileges. The agent autonomously collected broader code context, traced data-flow paths from user-controlled input sources to shell execution sinks, and wrote and executed Python and Bash scripts as needed to validate reachability and exploitability.

  5. Force-Reflection Anti-Hallucination Check: FENRIR’s built-in force-reflection mechanism required the agent to argue against its own conclusion — challenging whether the finding was truly exploitable versus a false positive. The agent resolved this internal adversarial challenge before issuing a final verdict.

  6. Verdict and Auto-Generated Disclosure Package: After passing force-reflection, the L2 agent produced a verified true positive with a complete auto-generated vulnerability report — including the specific vulnerable code lines, the injection mechanism, assessed impact, recommended remediation, and an exploitability analysis. This package was surfaced to a human researcher for final review and severity validation.

  7. Vendor Submission and Bidirectional Intelligence Loop: The human researcher validated the finding and submitted the command injection CVE to NVIDIA. Upon patch release, FENRIR’s n-day tracking component (23 autonomous LangGraph agents on the Mimir platform) immediately pivoted to the advisory page and the exact patch commit, re-scanning the repository. This re-scan uncovered two additional vulnerabilities: one newly introduced by the patch itself, and one representing a patch bypass for the original command injection flaw.

Patch Bypass Discovery: How FENRIR Found Two New Bugs in a Vendor-Patched CVE

Proof of Concept

  1. Initial zero-day discovery and disclosure: FENRIR’s three-stage pipeline identified a command injection vulnerability in NVIDIA’s Isaac Group framework — a robotics framework. The finding was submitted through the standard vendor triage process to the NVIDIA security team.

  2. Automatic n-day tracking via Mimir: Upon submission, FENRIR’s companion platform Mimir began tracking the open vulnerability as an active n-day. No manual intervention was required; the bidirectional intelligence feedback loop between FENRIR and Mimir handled the handoff automatically.

  3. Patch release triggers autonomous re-scan: When NVIDIA released the patch for the original CVE, 23 autonomous LangGraph agents running inside Mimir immediately activated. These agents navigated directly to the vendor’s security advisory page and to the exact code commit that implemented the fix — without human direction.

  4. Targeted re-analysis of the patched code: The 23 agents re-scanned the updated repository, focusing analysis on the code changed by the patch commit. This targeted re-scan is more efficient than a full-repository sweep because the diff surface is constrained to the vendor’s remediation changes.

  5. FENRIR surfaces two new vulnerabilities from the patched codebase: The re-scan results were fed back into FENRIR’s pipeline. FENRIR’s L1 and L2 triage stages processed the new findings and identified two distinct issues:

    • A newly introduced bug: A vulnerability that did not exist before the patch — inadvertently introduced by the vendor’s own remediation code.
    • A patch bypass: A vulnerability demonstrating that the original CVE’s root cause was not fully addressed, allowing an attacker to achieve the same or equivalent impact via a different code path the patch left unprotected.

Actionable Takeaways

  • Implement force-reflection as a structural prompt pattern in any agentic security reasoning system: require the model to produce and resolve a counter-argument before finalizing a verdict. This single mechanism materially reduces hallucinated confidence scores and shallow-reasoning failures in production.
  • Apply reachability-first filtering before committing expensive multi-turn LLM inference to a candidate finding. Dead code, test fixtures, and unreachable paths should be eliminated at the start of L2 triage — not after the agent has spent compute tracing a path that could never be triggered by an attacker.
  • Design your agentic verification stage around autonomous context collection rather than pre-allocated context windows. Provide the agent with full execution privileges in an isolated sandbox and let it drive its own investigation — call graphs, data flow traces, and multi-file traversal are capabilities modern reasoning models apply correctly when given the tools and latitude to do so.

Common Pitfalls

  • Allowing the agent to perform shallow reasoning without adversarial self-interrogation. Without a built-in force-reflection or counter-argument step, LLM agents in vulnerability verification tasks consistently produce overconfident verdicts based on surface-level pattern matching rather than execution-verified evidence. This directly inflates false positive rates at the most expensive stage of the pipeline.
  • Routing all findings to the heavyweight L2 stage without a cost-efficient upstream filter. The L2 stage's $0.61 median cost per finding is only economically viable because L1 has already eliminated 60%+ of candidates. Skipping L1 and sending raw SAST output directly to an agentic Opus session would multiply costs by 2–3x while flooding the agent with noise that degrades overall accuracy.

Production Results, Lessons Learned, and the Future of AI-Driven Vulnerability Research

Verified Production Metrics: What FENRIR Delivers at Scale

Since deploying FENRIR to production, the Trend Micro AI Zero Day Initiative team has measured four concrete outcomes that validate the AI zero-day vulnerability discovery pipeline design. These are not projected or theoretical — they reflect real results from a system actively submitting CVEs to vendors:

  • 2.5x more vulnerabilities discovered compared to the pre-FENRIR baseline — the same team, finding significantly more high and critical bugs in the same time window.
  • 80% reduction in false positive rate — analysts are no longer buried in noise. The cascade from 500 raw static analysis findings down to 250 post-L1, and finally to 10–25 high-confidence true positives, is working as designed.
  • 70% faster disclosure rate — because auto-generated vulnerability reports with full impact, PoC, and disclosure package are produced by the L2 agent, human reviewers skip straight to validation and vendor submission.
  • 3x overall team productivity increase — the combined effect of fewer false positives, faster triage, and automated documentation generation means the team operates at a fundamentally different leverage point.

As of the talk, FENRIR has submitted over 60 CVEs — all high or critical severity. There are 100+ more in ZDI pre-disclosure and 3,000 pending review. The deliberate decision to ignore medium-severity findings means every CVE represents a meaningful security impact.

Live Demo: FENRIR Scanning the AON UI Repository — 122 Findings to 21 True Positives

Proof of Concept

  1. Scan initiation: At the beginning of the talk, the speakers launched a live FENRIR scan against the AON UI repository from their local demo console. The scan ran autonomously in the background while the presentation continued — demonstrating that the pipeline operates asynchronously and does not require analyst attention during execution.

  2. Static analysis phase (L0 — signal collection): FENRIR’s four-tool SAST cascade ran first. YARA-X performed the broadest and fastest sweep, generating initial signals across millions of lines of code in seconds. Semgrep and CodeQL followed with progressively deeper analysis — taint tracking and data-flow analysis — to refine the candidate set. The cascade produced 122 raw findings as its output. No LLM tokens were consumed during this phase.

  3. L1 LLM triage (false-positive pruning): The 122 raw findings were routed to FENRIR’s fast L1 triage stage. Claude Sonnet evaluated each finding using a pre-allocated 50-line code context window and CWE-specific prompts. In this scan run, the L1 stage filtered out 37 findings, reducing the candidate set from 122 to 85. The speakers noted this was lower than typical performance — normally L1 eliminates over 60% — indicating the AON UI scan was a harder-than-average case with denser true signal.

  4. L2 deep agentic verification (sandbox reasoning): The surviving candidates were forwarded to the L2 deep agentic triage stage. Claude Opus was deployed into an isolated secure sandbox with full execution and write privileges. The agent autonomously collected code context, traced data flows using call-graph tools, executed bash commands and Python scripts, and applied force-reflection. This stage produced 21 verified true positives from the 85 L1 survivors.

  5. Result structure — what the console showed: Each verified finding was presented with:
    • Specific vulnerable lines of code identified within the AON UI repository
    • Detailed reasoning explaining why the code is exploitable
    • Counter-argument (the force-reflection output — what the agent considered as reasons it might NOT be a vulnerability)
    • Recommendation (suggested remediation or mitigation)
    • Impact assessment (severity and attack surface implications)
    • An example finding type mentioned was a path traversal driver vulnerability, illustrating that the true positives included high/critical severity classes
  6. Interactive LLM window: The console also surfaced a chat interface allowing human analysts to interact directly with the L2 agent. One key technique: having a separate LLM attempt to disprove each finding (mirroring a bug bounty program’s triage scrutiny) was found to be highly effective at pruning remaining false positives before final human review.

  7. Pipeline throughput summary:
    • Raw SAST findings: 122
    • Post-L1 triage (false positives pruned): 85 (37 eliminated — ~30% reduction; below typical 60%+ due to high signal density in this repo)
    • Verified true positives (post-L2 agentic verification): 21
    • Final signal-to-noise ratio: approximately 1-in-6 raw findings were true positives — all high or critical severity

LangChain CVE Patched Mid-Talk: Real-Time Zero-Day Disclosure in Action

Proof of Concept

  1. FENRIR targets the AI ecosystem itself: Rather than focusing exclusively on traditional software, the FENRIR team intentionally pointed their automated zero-day discovery engine at AI frameworks and components — including LangChain — because “the attack surface for AI components is absolutely massive.”

  2. Autonomous pipeline runs without human intervention: The FENRIR pipeline — static analysis cascade, L1 LLM triage, and L2 deep agentic sandbox verification — runs autonomously in production. No manual triggering or oversight was required for the LangChain scan to progress from initial signal collection through to a fully verified, high-confidence CVE finding.

  3. CVE submitted and tracked through disclosure: Once FENRIR’s L2 stage verified the LangChain vulnerability and generated a disclosure package, the finding was submitted to the vendor through the ZDI pre-disclosure process. The team submitted the CVE and began tracking it using the Mimir n-day research platform.

  4. Real-time patch alert received on stage: While Girnus and Chen were presenting, they received a live alert that the LangChain CVE had been patched mid-talk. The speakers cited this directly: “As we’re up here talking, we just got an alert that one of our CVEs in LangChain got patched. So as we’re talking, FENRIR is doing work for us.”

  5. Bidirectional intelligence loop activated post-patch: Per FENRIR’s documented architecture, a vendor patch event triggers Mimir’s autonomous LangGraph agents to analyze the advisory page and the exact code commit — capable of identifying patch bypass vulnerabilities or newly introduced bugs, exactly as demonstrated with the NVIDIA Isaac robotics framework CVE.

Hard-Won Lessons from a Year of Production Operation

Model sizing matters more than most teams realize. The L1 triage lesson applies broadly: a 0.6B parameter model was tested as the fast filter and failed — it constantly dropped true positives. Opus works but wastes resources at L1 and slows throughput. Sonnet is the right tool for L1. At L2, Opus is justified because you need the best security domain knowledge, the best reasoning, and the best agentic capabilities for a stage that costs $0.61 median per finding regardless.

Retrospective analysis closes the loop. The team keeps records of all pipeline runs — every LLM interaction is stored as potential future context for improving evaluation harnesses and reinforcing static analysis rule sets. When asked about retrospective analysis to catch true positives filtered out early, the answer was clear: the system actively tracks what was dropped and uses that data to tune upstream rule sets. This read-measure-improve cycle is what keeps recall high over time.

Extended thinking levels are on the roadmap. The architecture was designed before extended thinking capabilities existed, and “high async level” is currently used. Extended thinking level tradeoffs are explicitly slated for FENRIR 2.0 — a sign that the team is watching model capability frontiers carefully and building an architecture flexible enough to adopt them.

The system has survived multiple LLM generations. Context windows expanding from constrained early-2025 limits to over 1 million tokens, and FENRIR has adapted through each generation. The cascade architecture’s key advantage is that it doesn’t depend on any single model — the orchestration logic and signal collection layers remain stable while the LLM components can be upgraded as better models ship.

Targeting AI to Secure AI: The Logical Next Attack Surface

One of the most strategically significant points in the talk: the team deliberately pointed FENRIR at the AI ecosystem itself. AI components (model serving frameworks, agent orchestration libraries, vector databases, MCP servers, robotics control systems) are being deployed rapidly into high-value infrastructure, their attack surface is massive, and traditional security audits aren’t keeping pace.

The NVIDIA Isaac robotics framework command injection CVE is one example. The LangChain CVE patched live during the talk is another. Using AI to find vulnerabilities in AI is not just a clever positioning — it reflects where the highest-density vulnerability surface currently exists.

FENRIR 2.0: What’s Coming Next

Organizational-level scanning. The current focus is open-source repositories, but the next frontier is vulnerability discovery across entire organizations. Organizations have their own security boundaries, their own quirky internal frameworks, and their own trust assumptions between components — all of which represent attack surface that per-repo scanning misses.

Memory corruption bugs via adversarial LLM interrogation. Memory corruption was deliberately deprioritized at launch because early context window sizes made it impractical. With models like Claude Opus 4.6 offering dramatically expanded context and stronger reasoning, the team can now use the LLM as an adversarial interrogation component — actively probing heap layout assumptions, integer overflow conditions, and use-after-free paths in ways that weren’t feasible on earlier models.

Extended thinking level evaluation. Model-level thinking configuration — budget tokens, extended reasoning chains — is explicitly on the evaluation list for FENRIR 2.0 now that these capabilities are mainstream.

The core insight remains constant: AI zero-day discovery at scale requires a cascade architecture that separates cheap signal collection from expensive reasoning, keeps humans in the loop for high-confidence final decisions, and treats every pipeline run as data for improving the next iteration.

Actionable Takeaways

  • Track all pipeline runs and LLM outputs as a permanent record — use this data retroactively to tune upstream static analysis rule sets, identify any true positives that leaked out early, and build evaluation harnesses that prevent regression as models and rules evolve.
  • Point your AI-assisted vulnerability tooling at AI ecosystem components (model serving frameworks, agent orchestration libraries, vector databases) — this is currently the highest-density vulnerability surface and the least-audited attack area in most organizations.
  • Design your agentic triage architecture to be model-agnostic at the LLM layer — the cascade orchestration and signal collection must remain stable across model generations so you can upgrade LLM components as capabilities improve without rebuilding the pipeline.

Common Pitfalls

  • Treating medium-severity findings as worth investigating in an AI-driven pipeline wastes analyst time and dilutes the signal-to-noise ratio. FENRIR explicitly filters for high and critical only — medium vulnerabilities are discoverable by less expensive means and should be handled separately or deprioritized.
  • Assuming that once a pipeline is tuned for current model capabilities it will stay optimal — context window sizes, thinking levels, and reasoning capabilities are changing rapidly. A production system that doesn't continuously evaluate new model generations against its own evaluation harness will fall behind and begin missing bugs that newer models would catch.

Conclusion

FENRIR demonstrates that production-grade AI zero-day discovery is achievable today — not as a research prototype but as a system actively submitting high/critical CVEs to vendors at scale. The key is architectural discipline: three separate stages with explicit cost gates, each optimized for its role rather than trying to do everything in one pass. The SAST cascade handles volume cheaply, L1 triage handles noise reduction with recall-bias, and L2 agentic verification handles depth with force-reflection and autonomous context collection.

The lessons from a year of production operation are immediately applicable to any team building AI-assisted security tooling: match model capability to stage requirements rather than defaulting to the largest available model everywhere, treat the human-in-the-loop not as a bottleneck but as the final accountability layer that makes the system trustworthy, and point the tooling at AI ecosystem components — where the attack surface density currently exceeds what traditional auditing can cover.

For deeper context on the AI security threat landscape that makes FENRIR’s target selection strategically sound, and on security automation patterns that complement cascade pipeline design, those topic pages on this site offer additional coverage from related talks. For the vulnerability research methodology underpinning FENRIR’s disclosure process, see vulnerability research.


References & Tools

  1. YARA-X — Next-generation YARA implementation; used as the fastest broad-sweep pre-filter in FENRIR's static analysis cascade.
  2. Semgrep — Syntax-aware static analysis tool; used after YARA-X filtering for higher-precision pattern matching in the SAST cascade.
  3. CodeQL — Data-flow and taint analysis engine from GitHub; performs deep inter-procedural tracking of untrusted input to dangerous sinks.
  4. SpotBugs — Static analysis for Java bytecode; used via the FindSecBugs extension to cover compiled Java artifacts in FENRIR's SAST layer.
  5. Claude Opus — Anthropic's most capable model; powers FENRIR's L2 deep agentic verification stage for best-in-class security domain knowledge and multi-step reasoning.
  6. LangGraph — Agent orchestration framework; powers the 23 autonomous Mimir agents that automatically track vendor patches and trigger re-scans.
  7. Model Context Protocol (MCP) — Specification for tool-calling interfaces in LLM agent architectures; FENRIR originated as an MCP server component.
  8. Claude Sonnet — Anthropic's mid-tier model; used at FENRIR's L1 triage stage as the fast, capable model that eliminates 60%+ of false positives at a fraction of Opus cost.
Frequently asked

Questions from the audience

What is the FENRIR pipeline and how does it work?
FENRIR is a three-stage cascade pipeline that starts with traditional SAST tools (YARA-X, Semgrep, CodeQL, SpotBugs) to collect vulnerability signal cheaply, then applies a fast LLM triage layer (Claude Sonnet) to eliminate 60%+ of false positives with a single inference call, and finally routes survivors to a heavyweight agentic stage (Claude Opus in an isolated sandbox) that produces verified findings with complete disclosure packages.
How much does FENRIR cost per verified vulnerability?
FENRIR's L2 agentic stage costs a median of $0.61 per finding processed, and approximately $8.80 per verified true positive. This cost is economically viable because the upstream SAST and L1 triage stages eliminate over 60% of candidates before any expensive Opus inference occurs — so the $0.61 median applies only to the small set of candidates that have already survived two prior filters.
What is force-reflection and why does it matter for AI vulnerability research?
Force-reflection is a structural prompt mechanism that requires the L2 agent to argue against its own conclusion before issuing a final verdict. Without it, LLM agents in vulnerability verification tasks tend to perform shallow reasoning and produce overconfident verdicts. By forcing the agent to generate the strongest counter-argument for why a finding is NOT a vulnerability, force-reflection materially reduces hallucinated confidence scores and false positive rates at the most expensive pipeline stage.
Why did FENRIR target AI ecosystem components like LangChain and NVIDIA Isaac?
The FENRIR team deliberately pointed their automated pipeline at AI frameworks and components because the attack surface for AI infrastructure (model serving frameworks, agent orchestration libraries, vector databases, robotics control systems) is massive and largely unaudited by traditional security reviews. AI components are being deployed rapidly into high-value systems, making them a high-density target for automated zero-day discovery.
Watch on YouTube
FENRIR: AI Hunting for AI Zero-Days at Scale | [un]prompted 2026
Peter Girnus, Derek Chen, · 25 min
Watch talk
Keep reading

Related deep dives