The Cyber Archive

Code Is Free: Securing Software | [un]prompted 2026

Learn how OpenAI engineers built LLM-powered security reviewers, living threat models, and a daily dependency scanner using ~40 lines of GitHub Actions YAML and checked-in Markdown files.


Paul McMillan and Ryan Lopopolo presenting talk - Code Is Free: Securing Software at unprompted 2026
Paul McMillan and Ryan Lopopolo presenting talk - Code Is Free: Securing Software at unprompted 2026

Most teams know what good security looks like — they just never have enough engineering hours to enforce it everywhere. AI-powered security guardrails in CI/CD pipelines are changing that equation: instead of waiting for a quarterly vendor review or a manual audit, you can run a bespoke LLM security reviewer on every single pull request for roughly the cost of a few API tokens. Paul McMillan and Ryan Lopopolo from OpenAI demonstrated exactly that at [un]prompted 2026, showing how their team built a million-line codebase — 250,000 lines of which are prompts — with near-zero manual security overhead.

This post breaks down the “code is free” philosophy they presented: how to turn threat models into checked-in living documents, how to wire LLM agents into your CI pipeline with ~40 lines of YAML, and how an afternoon of work produced a daily dependency scanner covering 1,500 packages. If you’ve been putting supply chain hardening and automated security reviews below the line, this is the practical playbook to move them above it.

Key Takeaways

  • You'll learn how to encode your organization's security expertise directly into the codebase as living documents and executable guardrails — making that knowledge available to every engineer and coding agent on every PR, not just during periodic reviews.
  • You'll be able to build and deploy custom LLM-powered security reviewers inside your existing CI/CD pipeline in under an afternoon, eliminating the need for expensive vendor contracts or complex integrations.
  • Apply the AI-assisted dependency scanning approach to get daily, actionable risk profiles of your entire dependency tree — including transitive dependencies — and feed those findings into automated upgrade or removal workflows.

Why Traditional Security Tooling Falls Short — And Why AI Changes the Equation

The Vendor Trap: Why Expensive Security Programs Still Leave You Exposed

DevSecOps and AI-powered security guardrails in CI/CD didn’t emerge in a vacuum — they’re a direct response to the chronic failure mode of traditional vendor-driven security programs. Paul McMillan, security engineer at OpenAI, opens with a blunt diagnosis that most practitioners will recognize immediately: you identify a need, you evaluate a handful of options, you spend months integrating one into your environment, you pay a vendor a lot of money, and then you spend even more time pushing that integration everywhere else. Months later, the question is whether you’re actually getting value. The honest answer, as McMillan frames it, is “probably some” — compliance teams are satisfied, but the tooling is stiff, inflexible, and full of special edge cases that vendor feature requests won’t resolve for quarters at minimum.

The Real Bottleneck Is Not Knowledge — It’s Engineering Time

What makes this problem particularly frustrating is that the knowledge gap isn’t the issue. Security teams already have a long list of best practices they want to enforce. There are endless lists of what good looks like, and innumerable vendors willing to sell solutions that are — as McMillan notes — almost as much work to integrate as building the thing yourself. This is especially true in any organization with legacy infrastructure: the assumptions baked into commercial security tooling rarely mesh cleanly with environments that weren’t founded yesterday.

The core constraint, as McMillan states directly, is not expertise — it’s engineering hands and friction. Every team has a long list of things they wish they had time to do. Traditional tooling doesn’t reduce that list; it adds to it, because every new platform requires its own integration, its own dashboard, its own maintenance burden.

“Code Is Free” — The Thesis That Changes Everything

The central claim of this talk is straightforward and deliberately provocative: software doesn’t cost anything to build anymore. That includes security software. If you want a security outcome, you can ask for it. You don’t need to wire up a giant vendor harness and dashboard. You don’t need to deeply integrate with someone else’s vision of what your security program should look like.

This is the shift that agentic AI coding models enable. Your developers are already using these models to build your product code. The logical extension is to apply the same capability to security enforcement. The mechanics McMillan describes are intentionally minimal:

  • Add a new test to your existing CI/CD pipeline
  • Commit a text file containing the things you want checked
  • Write a tool call that returns zero or one, plus some text explaining why
  • Feed that back into your CI framework

No vendor contracts. No frameworks. No open-source project to maintain. As McMillan puts it: “I’m not even releasing anything open source here because you should just ask Codex[1] or Claude to write this integration for you, customized to your own environment.”

Why Bespoke Beats Vendor at the Edge Cases

The deeper insight here is about fit. Vendor tools are built for the median enterprise. They optimize for broad applicability, which means they’re often weakest exactly where your environment is most unusual — and every organization with meaningful legacy infrastructure is unusual in some way. A bespoke LLM-driven security reviewer built against your actual codebase, your actual threat model, and your actual risk tolerance doesn’t have that problem. It knows your environment because it was built for your environment.

Ryan Lopopolo reinforces this from the product engineering side: his team built a product with approximately one million lines of code — 250,000 of which are prompts — with near-zero manual security overhead. The key enabling insight is that what remains legible both organizationally and textually in the codebase is what everyone, including AI agents, can actually act on. A threat model locked in a Slack thread from three months ago is not something that informs how code gets written. A threat model checked into the repository as a Markdown file is.

Shifting Left Without Shifting Burden

Shift-left security has been a mantra for years, but the traditional implementation — earlier manual reviews, more developer training, heavier pre-commit tooling — tends to trade one kind of friction for another. The approach described here shifts the burden to the LLM, not to the developer. The developer gets precise, contextual feedback at PR time. The security engineer gets coverage across every PR without having to personally review each one. The CI pipeline gets a new class of checks that are as cheap to run as any other test.

This reframes the economics entirely. The constraint on traditional security tooling is that human attention doesn’t scale. LLM inference does. Tokens, as Lopopolo observes, are cheap relative to engineering time — and that asymmetry is what makes the entire approach viable at scale.

Actionable Takeaways

  • Audit your current security tooling stack for integration overhead: if any tool requires more time to maintain and integrate than it would take to build a bespoke LLM-based equivalent, that's a candidate for replacement. Start with the tool your team complains about most.
  • Start with a single CI/CD check — pick one security property you wish was enforced on every PR, write a plain-text description of what "pass" and "fail" look like, and ask a coding agent to generate the GitHub Actions[2] workflow that runs an LLM against that check non-interactively. This is the minimum viable implementation of the "code is free" thesis.
  • Move your security knowledge into the codebase: any security principle currently living in a Slack thread, a Google Doc, or an internal wiki page is invisible to both coding agents and new engineers. Convert it to a checked-in Markdown file so it becomes actionable context for every automated reviewer and human contributor going forward.

Common Pitfalls

  • Treating vendor tools as a prerequisite for security coverage: organizations often delay implementing any security automation while waiting for budget approval, procurement cycles, or vendor evaluation to complete. The "code is free" framing directly challenges this — the practical alternative (a few dozen lines of YAML and a text file) is available immediately and can be deployed faster than most vendor contracts can be signed.
  • Assuming security knowledge is the bottleneck: teams frequently frame their security gaps as a knowledge problem and invest in training, certifications, or consulting engagements. McMillan's diagnosis is the opposite — the knowledge already exists inside the team; the bottleneck is the engineering capacity to enforce it consistently. Misdiagnosing the constraint leads to solutions (more expertise) that don't address the actual failure mode (not enough hands to apply what's already known).

Living Threat Models: Encoding Security Assumptions Directly Into the Codebase

The Problem With Traditional Threat Models

Every security team has done the exercise: a room full of engineers, a whiteboard, a sprint blocked off for threat modeling. A document gets produced, tagged with the date, and uploaded somewhere. Then it becomes stale immediately. Within weeks, the codebase has evolved and the threat model reflects a product that no longer quite exists. No one reads it. The coding agents certainly can’t read it — it’s locked in a Slack thread or a Google Doc that’s outside the universe of code, Markdown, and scripts they operate within.

Ryan Lopopolo put it plainly: a threat model trapped in a document management system “is not something that I am able to build durable product with.” The problem is not that threat models lack value — it’s that they’re stored in the wrong place and validated at the wrong cadence.

Threat Models as Checked-In Artifacts

The solution is straightforward: generate the threat model, check it into the repository, and treat it like code. Modern LLMs are capable of deducing a reasonable threat model directly from the existing codebase. You don’t need a multi-day workshop to produce the first version — ask a coding agent to produce one from what’s already there, review it with your security partners, and commit it.

Once the threat model lives in the repo, it becomes visible to every human engineer and every coding agent working on that product. It’s no longer a point-in-time artifact — it’s a living document that can be referenced, validated, and updated continuously.

Key steps to implement this:

  • Ask a coding agent (e.g., OpenAI Codex[1]) to draft a threat model from your existing codebase. These models are trained on the breadth of human knowledge about software security and can identify relevant threat surfaces without being explicitly briefed on every detail.
  • Review the draft with your security partners and iterate. The agent can interview you — treating the security engineer as a tool to invoke and extract expertise from.
  • Check the threat model into the repository as a Markdown file (e.g., THREAT_MODEL.md or alongside a security.md). It now lives where the agents and engineers operate.

Continuous Validation on Every PR

Checking in the threat model is only half the work. The second half is making sure it stays accurate as the codebase changes — and doing that without requiring a human to manually review every pull request.

The approach Ryan described is approximately 40 lines of GitHub Actions[2] YAML:

  1. On every PR, an agent is invoked with a directive: “You are responsible for ensuring the threat model of this codebase is current and up to date, and that this PR does not invalidate any of the assumptions that are in it.”
  2. The agent reads the checked-in threat model and diffs it against the proposed changes.
  3. If the PR materially invalidates a documented assumption — introducing a new data flow, changing authentication boundaries, adding a new external dependency — the agent flags it.

Tokens are cheap. Running this check on every PR costs a small fraction of what a human security review costs and provides coverage that no human review cadence can match at scale.

Routing Ambiguity to the Right Human

Not every threat model delta is clear-cut. For cases where the agent cannot confidently assess whether a change is safe, the agent should escalate — but precisely, not generically. Ryan described granting the agent access to the GitHub CLI[3] (gh) in CI, enabling it to:

  • Tag the relevant human expert when a material change invalidates a previously risk-accepted assumption.
  • Explain clearly why they’ve been brought into scope: what assumption is at risk, what the PR changes, and what specific question requires human judgment.

This is a fundamentally different model from “send a security alert to the security team.” The agent treats the human in the loop as a tool to invoke with a well-formed query — the same way the security engineer previously functioned as a tool for the product team. The output is durable: the human’s resolution gets encoded back into the threat model or into a guardrail, not lost in a Slack thread.

Why This Changes the Economics of Threat Modeling

Under the traditional model, the cost of threat modeling scales with the number of reviews and the size of the security team. Under this model, the marginal cost of validating a new PR against the threat model is roughly the cost of the tokens consumed — a few cents at most. The threat model itself is refined continuously as the codebase evolves, rather than becoming increasingly inaccurate between annual review cycles.

Ryan summarized the core insight: “The agents crave text. Figure out ways to give them text and inject it into all the other agents that you have working on the codebase.” A checked-in threat model is exactly that — durable text that every agent and human operating on the codebase can consume, validate against, and improve.

Threat Model as a Required PR Review Step: 40 Lines of GitHub Actions YAML

Proof of Concept

  1. Generate the initial threat model from the existing codebase. Rather than scheduling a workshop, ask a coding agent (e.g., OpenAI Codex[1]) to deduce the threat model directly from the code as it currently exists. Prompt: “Analyze this codebase and produce a threat model covering trust boundaries, data flows, assets, and attacker objectives.” The model is trained on the full universe of software and security knowledge and can produce a credible first draft in minutes.

  2. Check the threat model into the repository as a first-class artifact. Save the output as THREAT_MODEL.md (or equivalent) in the root of the repository alongside README.md, security.md, and other governance documents. This placement is deliberate: coding agents and human engineers operate in the universe of code, Markdown, and scripts inside the repo. A threat model locked in a Slack thread or a Google Doc is inaccessible to both. Once it is in the repo, every agent and every engineer onboarding to the codebase can read it.

  3. Author the GitHub Actions workflow (~40 lines of YAML). Create .github/workflows/threat-model-review.yml. The workflow:
    • Triggers on pull_request events (opened, synchronize, reopened).
    • Checks out the repository so the agent has access to the full diff and the threat model file.
    • Invokes the LLM agent non-interactively, passing: (a) the contents of THREAT_MODEL.md, and (b) the PR diff (git diff origin/main...HEAD).
    • The agent prompt: “You are responsible for ensuring the threat model of this codebase is current and up to date and that this PR does not invalidate any of the assumptions that are in it. Review the diff and determine whether any changes conflict with, extend, or invalidate the documented threat model.”
    • Parses the agent’s structured output (a JSON schema with fields: passes, judgment, comments).
    • Fails the CI check (exits non-zero) if passes is false, surfacing the agent’s comments directly in the PR status check.
  4. Grant the agent access to the GitHub CLI for human escalation. In the workflow’s job permissions, grant gh CLI access. Extend the agent prompt: “If a material change has occurred that invalidates a risk that has already been accepted and documented, use gh pr comment to tag the responsible security engineer for review. Explain precisely which threat model assumption is at risk and what ambiguity you need them to resolve.” This implements a human-in-the-loop pattern where the agent acts as a first-pass reviewer and escalates only on genuine ambiguity.

  5. Make the check required in branch protection rules. In the GitHub repository settings under Branch protection rules, add threat-model-review as a required status check for the target branch (e.g., main). PRs cannot be merged until the check passes. This elevates the threat model from advisory documentation to an enforced engineering gate — identical in weight to passing unit tests or lint.

  6. Keep the threat model current through the same PR process. When the agent flags that a PR materially changes the threat model, the resolution is not to override the check but to update THREAT_MODEL.md as part of the PR. The agent re-runs on the updated diff and, if the new documentation accurately reflects the change, passes. Over time the threat model self-updates with each significant architectural change rather than being a point-in-time artifact that immediately goes stale.

  7. Validate the cost profile. At roughly 50% of the token budget allocated to CI-side reviewer agents (as reported by the OpenAI team), this workflow is cost-effective relative to engineering time. Tokens are cheap compared to the alternative: a security engineer manually reviewing every PR, or a quarterly audit discovering accumulated drift. The workflow requires no vendor contract, no SDK integration beyond the model’s API, and no proprietary platform — just a YAML file, a model endpoint, and an output schema.

Actionable Takeaways

  • Generate your initial threat model by asking a coding agent to deduce it from the existing codebase, then validate it with your security team. Check it into the repository as a Markdown file so it is accessible to both human engineers and coding agents operating in the codebase.
  • Add a GitHub Actions workflow (~40 lines of YAML) that invokes an LLM on every PR with the directive to validate the changes against the checked-in threat model. Flag any PR that materially invalidates a documented security assumption before it can be merged.
  • Grant the CI agent access to the GitHub CLI so it can tag the appropriate human expert — with precise context about what assumption is at risk and what question needs resolution — rather than issuing vague security alerts.

Common Pitfalls

  • Storing the threat model in a Slack thread, Google Doc, or any system outside the code repository. Coding agents cannot read documents stored outside their file system context, and human engineers rarely revisit documents stored in separate systems. The threat model must live in the repo to be operationally useful.
  • Treating the threat model as a one-time deliverable rather than a living artifact. A threat model that is generated once and never validated against subsequent PRs becomes misleading — it documents a past state of the system, not the current one. Continuous validation in CI is what keeps it authoritative.

Building LLM-Powered Security Reviewers Inside Your CI/CD Pipeline

From Security Intent to Executable Guardrails

The core insight driving CI/CD security automation is deceptively simple: if you can write down what good security looks like, you can have an LLM enforce it on every pull request automatically. Paul McMillan and Ryan Lopopolo at OpenAI operationalized this by building a security.md file that captures their organization’s secure coding principles, then running a non-interactive coding agent against it on every PR — all wired together with approximately 40 lines of GitHub Actions YAML.

The mechanical steps are straightforward:

  1. Write a security.md — a plain-text file checked into the repository that encodes your organization’s secure coding standards, threat model assumptions, and the classes of vulnerabilities you care most about eliminating.
  2. Create a CI job that runs a coding agent (such as OpenAI Codex[1]) non-interactively against the incoming diff, providing the security.md as context.
  3. Define an output schema with three fields: a judgment (pass/fail), a comment explaining the finding, and a flag indicating whether the PR passed the security checks.
  4. Parse the agent’s JSON output and surface it in your CI framework the same way you would any other test result — zero means pass, non-zero means fail.

No vendor contracts. No complex integrations. No platform team required. As Ryan put it directly: “This is like 40 lines of YAML in GitHub Actions right now.”

CI/CD LLM security reviewer architecture showing PR flow through GitHub Actions, LLM agent review, human escalation, and parallel dependency scanner pipeline

Writing a security.md That Actually Works

The leverage here is disproportionate to the effort. Ryan described a concrete example: adding two sentences to their security.md that read “secure code comes from secure interfaces that are impossible to misuse” produced immediate, measurable uplift. That two-sentence principle caused the coding agents reviewing PRs to automatically prefer high-level file system primitives over manual calls to realpath and clean stitched together — eliminating an entire class of path traversal vulnerabilities without any additional tooling.

This is the key mental model: you can compress entire vulnerability classes into a paragraph of text, then have an agent review every PR against those guardrails and tell engineers exactly where they deviated.

Effective security.md content includes:

  • High-level secure interface principles — e.g., “prefer APIs that are impossible to call incorrectly over those requiring caller discipline”
  • Banned patterns — explicit statements of what not to do, drawn from past findings, AppSec review comments, and historical PR feedback
  • Threat model anchors — the key assumptions from your living threat model that every PR must not invalidate
  • Escalation signals — conditions under which the agent should tag a human instead of making an autonomous judgment

The security.md becomes a durable institutional memory. Every Slack message a security engineer has ever sent to flag a bad pattern, every PR comment identifying an anti-pattern, every AppSec review finding — these are all inputs that belong in this file. Ryan’s framing: “Every Slack message that you have ever sent to let an engineering team know that they’re not operating in the way that you need them to, just reply to it at Codex. Make it so.”

Wiring the LLM Reviewer Into CI

The CI integration pattern Paul described has four components:

1. A security policy file (the input)

A text file — security.md or a dedicated security_checks.txt — that lists the things you want reviewed. This is the only configuration surface. Changes to security policy are just text file commits.

2. A non-interactive agent invocation

Run the coding agent (Codex in their case) non-interactively against the incoming diff with the security file as context. The agent reads both the changed code and the policy, then produces a structured judgment. Paul’s description: “a tool call that returns zero or one, and some text about why.”

3. A structured output schema

The agent emits JSON (or YAML) with at minimum:

  • judgment: pass or fail
  • comments: human-readable explanation of any findings
  • passed: boolean

This schema is what makes the output parseable by deterministic CI code rather than requiring another LLM to interpret the result.

4. CI framework integration

Parse the output and feed it back into the CI framework exactly like any other test. The result appears in the PR check list alongside unit tests and linting. If the agent flags an issue, the PR is blocked until it is addressed — or until a human with appropriate authority overrides it.

Routing Material Findings to the Right Human

Not every finding warrants a hard block. Some security-relevant changes are intentional, risk-accepted, or require judgment that exceeds the agent’s authority. The solution is giving the CI agent access to the GitHub CLI[3] (gh) so it can tag expert humans directly.

The workflow Ryan described:

  1. The agent detects that a PR makes a material change to something documented in the threat model or security.md as a known risk area.
  2. Instead of failing the build blindly, the agent opens a PR comment tagging the relevant security engineer.
  3. The comment includes: what changed, why it triggered the flag, what the agent’s assessment is, and what specific ambiguity it needs the human to resolve.

This is the key operational shift: treat humans as tools that agents invoke when the problem exceeds their authority. The agent does not escalate with a vague alert — it escalates with a precise, context-rich question. The human’s job is to answer that question, not to re-derive the context from scratch.

Ryan described this as a core operating principle for his team: “One of the core ways that I’ve changed how I operate here is to treat the humans as tools to empower the agents and other engineers in the codebase in order to make forward progress.”

Converting Feedback Loops Into Permanent Guardrails

One of the highest-leverage practices the talk described is treating every piece of feedback — PR comments, test failures, lint errors, AppSec review findings — not as one-time corrections but as signals about missing guardrails.

The process:

  1. A security concern surfaces (Slack message, PR comment, review finding, test failure).
  2. Instead of patching the immediate issue, step back and ask: what process failure allowed this code to be produced?
  3. Fix the point issue.
  4. Add an executable guardrail — a custom lint, a new check in security.md, a new agent review rule — that statically prevents this class of mistake from being made again.

Ryan’s framing: “I say, what are the process failures that led to the code being permitted to be produced in this way? Fix the point issues and then add executable guardrails to the codebase that statically disallow these mistakes from being made in the first place.”

These bespoke, fit-for-purpose lints do not need to be production-grade, open-source-ready tools. They are internal, single-purpose scripts written in an afternoon. The bar is: does it catch the class of mistake we care about? If yes, check it in and run it in CI. “Cheap bespoke fit-for-purpose lints that will never see the light of day out of your enterprise are free things that you can vibe up.”

Preventing Slop: Encoding Quality Standards Directly

Ryan introduced the concept of “not permitting slop” as a specific, achievable goal for agentic systems. The mechanism: write down what good looks like, explicitly, and have the agent refuse to produce output that falls short of it.

This means the security.md is not just a list of things to check — it is a positive statement of what correct, secure code looks like. The agent is then instructed to enforce that standard, not merely to find violations of a negative checklist.

The practical result is that the security.md and associated documentation become prompt injections into every agent and human operating on the codebase. When a new engineer onboards, they read it. When an agent generates a PR, it reads it. When a reviewer evaluates a diff, the context is already present. The expertise that previously lived in a few senior engineers’ heads is now continuously available to everyone and everything touching the codebase.

Token Budget and Practical Cost

A common objection is cost. Ryan addressed this directly in the Q&A: their team allocates roughly 50% of their token budget to code production and 50% to refinement — which includes all the CI reviewers, scanner agents, and feedback loops. The conclusion: “relative to engineering time, tokens are cheap.”

The security review CI job described — a non-interactive agent run on each PR against a security.md — consumes a small, predictable number of tokens per PR. At current API pricing, this is measured in cents, not dollars, per pull request. The ROI calculation is straightforward: what is the cost of one security finding that reached production versus the cumulative cost of all PR-level LLM review calls in a year?

security.md Guardrail: Eliminating Path Traversal Vulnerabilities With Two Sentences

Proof of Concept

  1. Identify the vulnerability class to eliminate. The team recognized that path traversal vulnerabilities were recurring because engineers (and coding agents) were stitching together low-level filesystem calls — realpath, clean, and similar primitives — manually. Each combination was a potential misuse opportunity that could allow directory escape.

  2. Write the guardrail as a principle, not a rule. Rather than enumerating every forbidden pattern, Ryan added two sentences to security.md that encode the underlying design philosophy: “Secure code comes from secure interfaces that are impossible to misuse.” This framing tells the agent why certain patterns are dangerous and what good looks like at an architectural level.

  3. Wire the security.md into the CI LLM reviewer. The security.md file — already part of the repository — is fed to the LLM security reviewer agent that runs on every PR. The agent reads the guardrail, understands that it prohibits unsafe interface composition, and evaluates each diff against that principle.

  4. Observe the behavioral shift in code generation. Once the guardrail was present, coding agents stopped producing raw realpath/clean chains and instead generated calls to high-level primitives that perform rooted filesystem access with safe, opinionated APIs. The interface itself becomes impossible to misuse because the unsafe composition is never expressed in the code at all.

  5. Validate the guardrail catches violations in PRs. When a PR is opened that contains a path-handling pattern inconsistent with the guardrail — such as a manually assembled path traversal check — the LLM reviewer flags it with a comment explaining which principle is violated and what the correct approach is. Engineers and agents receive precise, actionable feedback rather than a vague security warning.

  6. Generalize the pattern to other vulnerability classes. The same two-sentence structure can be replicated for any vulnerability class. The technique is: encode the desired design property as a principle in security.md, run an LLM agent against every PR to check compliance, and let the agent surface violations. An entire class of vulnerabilities — SQL injection, command injection, SSRF — can each be compressed into a paragraph of text rather than requiring a dedicated static analysis tool or vendor integration.

  7. Compound the effect with agent-to-human escalation. For material violations, the CI agent can tag the responsible security engineer via the GitHub CLI[3] (gh), providing the specific diff, the guardrail it violated, and a plain-language explanation of the risk. This keeps a human in the loop without requiring the human to review every PR — the agent acts as a triage layer, escalating only when the guardrail is breached.

Actionable Takeaways

  • Create a security.md file checked into your repository that encodes your organization's secure coding principles in plain language — including interface design rules, banned patterns, and threat model assumptions. Start with the most recent AppSec review findings or the last three security-related PR comments your team received, distill them into principles, and check the file in. Wire a non-interactive Codex or equivalent agent to review every PR against it using ~40 lines of GitHub Actions YAML.
  • Define a structured output schema (judgment, comments, passed boolean) for your LLM security reviewer so its output is machine-parseable by deterministic CI code. This schema is what allows you to block PRs programmatically and surface findings alongside unit test results, rather than depending on humans to read agent prose output and decide what to do.
  • Grant your CI agent access to the GitHub CLI and instruct it to tag the relevant human expert — with a precise, context-rich question — when a PR makes a material change that exceeds the agent's authority to approve autonomously. This replaces vague security alerts with specific escalations and lets you scale automated review without losing human oversight on high-risk changes.

Common Pitfalls

  • Treating AppSec review findings as one-time fixes rather than signals about missing guardrails. The pattern described is to fix the immediate issue AND add an executable guardrail (a lint, a new security.md rule, a new agent check) that prevents the same class of mistake from recurring. Skipping the second step means the same vulnerability class will re-emerge as new engineers and agents produce new code without the benefit of that hard-won institutional knowledge.
  • Writing a security.md that is too abstract or too narrow to drive consistent LLM behavior. The file needs to encode positive principles ("secure interfaces that are impossible to misuse") rather than just a list of prohibited patterns. Overly abstract principles produce inconsistent enforcement; overly specific rules fail to generalize to new code patterns. The two-sentence example from the talk — which eliminated an entire path traversal vulnerability class — illustrates the right level of abstraction: a principle about interface design, not a rule about a specific function call.

AI-Assisted Dependency Scanning and Supply Chain Hardening

Why Supply Chain Hardening Keeps Slipping Below the Line

Supply chain security and automated dependency scanning with AI agents addresses one of the most persistent gaps in enterprise security programs: supply chain hardening that everyone agrees is high value but never gets prioritized. With 1,500 dependencies in a typical lock file, auditing direct dependencies is already unrealistic — auditing transitive dependencies is essentially impossible without automation. The OpenAI team demonstrated that a 15-minute conversation with a security engineer and a few hours of agent work can close that gap permanently.

The core insight is that supply chain risk isn’t a knowledge problem. Teams already know the best practices: keep package managers up to date, restrict post-install script execution in CI, monitor upstream repository health and author reputation. The obstacle is time — specifically, the engineering hours required to apply those principles consistently across a large, evolving dependency tree.

Turning a Point-in-Time Observation Into a Durable Guardrail

Ryan Lopopolo described a concrete starting point: a PR in the open-source Codex[1] repository updated PNPM[4] to fix security vulnerabilities and hardened the package installer policy to forbid post-install scripts from running. Rather than simply noting the change, the approach was to capture the underlying principles and apply them to their own codebase.

The process took 15 minutes:

  1. Copy the diff from the upstream PR.
  2. Prompt Codex: “Apply the same set of principles to our codebase.”
  3. Review the resulting diff, which applied identical hardening — updated PNPM version, disabled post-install scripts — to the internal project.

But the more important step was articulating why those changes mattered. The durable principles extracted from that single PR were:

  • Package managers are a high-risk surface — keep them as current as possible.
  • Arbitrary code execution during package install must be blocked, particularly in CI environments where the blast radius is largest.
  • Dependencies that require post-install scripts should trigger a human-in-the-loop review before being permitted.

Writing those principles into a security.md or equivalent checked-in document means every subsequent agent — dependency update reviewers, PR scanners, onboarding automations — inherits that context. When a Dependabot PR attempts to update the PNPM workspace file, an agent reviewing it will not blindly accept the change to get the build passing. It will check whether the update preserves the documented installer policy.

Building a Bespoke Dependency Scanner in an Afternoon

The supply chain hardening story continued with a more ambitious project: a custom dependency scanner built in approximately four hours. The starting point was a 15-minute coffee conversation between Ryan and Paul McMillan (security engineer) about the sprawl in their lock file and what principles should govern how to reason about it.

Paul’s challenge: could Codex produce a sniff test of all dependencies — a set of risk profiles identifying packages to consider dropping, in-housing, or upgrading? Could the model reliably gather reputation signals for upstream packages?

The result was a scanner that:

  • Runs daily in CI via a scheduled workflow
  • Forks 16 parallel agents to spider through the codebase concurrently
  • Profiles every direct dependency and its upstream transitives
  • Surfaces reputation signals: recent repository activity, author reputation, maintenance status
  • Publishes a structured report to an internal hosting platform each day
  • Outputs SARIF reports compatible with GitHub Advanced Security[5]

Those SARIF reports then feed into a second layer of automation: agents that pick up the findings and can propose dependency upgrades, flag packages for removal, or suggest migrating to alternative frameworks that don’t carry the same risk profile.

Iterating on the Scanner: From Point Instances to General Principles

An important refinement cycle illustrates how to get more value from agent-produced guardrails. After the initial scanner was deployed and producing human-legible artifacts, Paul and Ryan continued iterating asynchronously in a Slack thread.

Paul’s feedback: the agent appeared to be operating on specific, point-instance rules rather than general principles. It was responding to individual guardrails rather than reasoning from a broader framework of what supply chain risk actually means.

The fix was to reframe the scanner’s prompt around more general principles — what the agent should use as its reference frame — rather than a checklist of specific checks. The conversation happened casually in Slack; Ryan copied the thread into Codex using the ChatGPT Slack integration, asked clarifying questions until the principles were clear, then let Codex update the scanner’s behavior accordingly.

This iteration loop produced a materially better scanner without significant engineering investment. The refinement cost was a Slack conversation and a Codex run.

How Documentation Stacks Across the Entire System

One of the most significant compounding benefits of this approach is how documented supply chain principles propagate through the entire agent ecosystem. When the rationale for hardening package installer policies is written down — specifically, that it prevents arbitrary code execution in CI — that context becomes available to every other agent operating in the codebase.

The dependency scanner’s findings, the PR reviewer’s guardrails, and the onboarding documentation all reference the same underlying principles. An agent reviewing a routine Dependabot PR will not treat it as an isolated update; it will evaluate whether the change preserves the documented policies around installer scripts and package manager versions.

License scanning and depth analysis fall out of the same infrastructure for free once the general framework is established. The initial investment in encoding supply chain principles as checked-in text pays dividends across every downstream agent and human that touches the codebase.

Supply Chain Hardening From a Slack Thread: Applying a Codex PR to Your Own Codebase in 15 Minutes

Proof of Concept

  1. Identify the source signal. Ryan was following the Codex team’s open-source repository activity. A contributor named Matthew submitted a PR that did two things: upgraded PNPM to a recent version patching known security vulnerabilities, and hardened the package installer policy to forbid post-install scripts from running. This PR was visible in a Slack channel Ryan monitored.

  2. Extract the diff. Ryan copy-pasted the diff from Matthew’s PR directly from the Slack thread or the repository. No additional tooling or context extraction was needed — the raw diff contained enough signal.

  3. Invoke Codex with the diff as context. Ryan passed the diff to Codex with a simple natural-language instruction: apply the same set of principles to our codebase. Codex interpreted the intent of the changes, not just their literal content, and adapted them to the target codebase’s structure within 15 minutes.

  4. Receive and review the generated diff. Codex produced a diff applying equivalent hardening to the target project — updating the relevant PNPM version and configuring the installer policy to block arbitrary post-install script execution during CI builds.

  5. Distill the durable principles behind the change. Rather than treating this as a point-in-time fix, Ryan identified the underlying security principles:
    • Package managers are a high-risk attack surface; keep them as up to date as possible.
    • Arbitrary code execution during the package install process — especially in CI — must be blocked by policy.
    • For dependencies that require post-install scripts to function, a human reviewer must explicitly approve whether execution is safe.
  6. Document the principles in the codebase. These three principles were written down in the project’s security documentation (e.g., security.md or an equivalent guardrails file). This made the rationale legible to onboarding engineers and to any coding agents that subsequently operate on the codebase.

  7. Leverage the documented context in downstream agent workflows. Because the hardening rationale was now encoded in the codebase, agents running automated dependency update reviews had access to it. When Codex encountered a PNPM workspace file update, it understood that blindly bumping the version to pass a build is not acceptable — it had to honor the installer policy constraint. This prevented regressions from automated dependency updates undoing the hardening.

  8. Generalize the pattern. The same workflow applies to any supply chain guardrail: spot a relevant fix anywhere (a public PR, a Slack security advisory, a CVE writeup), extract the diff or the principle, tell Codex to apply it to your codebase, and immediately document why — so every future agent and engineer benefits from the decision.

4-Hour Custom Dependency Scanner: 16 Parallel Agents Profiling 1,500 Packages Daily

Proof of Concept

  1. Identify the problem scope. The codebase in question is an Electron application with approximately 1,500 entries in the lock file. Manually auditing direct dependencies is already impractical; auditing all transitive dependencies individually is impossible at human scale. Supply chain hardening had remained perpetually below the prioritization line.

  2. Extract principles via a brief expert interview. The product engineer had a 15-minute coffee conversation with a security engineer (Paul) to extract general principles — not a detailed spec, but a set of high-level policies. The core question posed was: “Can you get Codex to do a sniff test of all the dependencies in your codebase to give you a set of risk profiles? Some dependencies you should consider dropping, in-housing, or upgrading. Can you get the model to reliably gather reputation signals for your upstream?”

  3. Generate the initial point-in-time snapshot. Using Codex as the coding agent, the engineer directed it to spider through the codebase and evaluate every direct dependency. Codex gathered reputation signals for each upstream package — including recent repository activity, author reputation signals, and general risk posture assessments. This initial pass produced a legible artifact in roughly 4 hours of background terminal time during a normal afternoon.

  4. Harden the scanner into a daily CI job. Rather than treating the initial snapshot as a one-off audit, the engineer converted it into a durable recurring job. The scanner was wired into CI and configured to fork 16 parallel agents on each run, distributing the work across all direct dependencies and their upstream transitives simultaneously.

  5. Define the per-dependency profile structure. Each agent run produces a structured risk profile per package covering: (a) every direct dependency in the lock file, (b) all upstream transitive dependencies, (c) an assessment of the upstream repository’s health — recent commit activity, issue tracker state, maintainer responsiveness, (d) reputation signals about the package authors, and (e) an overall risk posture rating for the dependency.

  6. Publish findings to an internal hosting platform. The daily scanner output is published to an internal hosting platform, producing human-readable reports. A separate layer of code generates SARIF-formatted reports from those findings.

  7. Feed SARIF reports into GitHub Advanced Security[5]. The SARIF output is pushed into GitHub Advanced Security, which acts as the broker. Other agents in the system — dependency upgrade agents, removal agents, in-housing evaluation agents — can then pick up these findings and act on them: upgrading dependencies, removing them, or migrating to alternative frameworks that do not carry the same risk profile.

  8. Iterate cheaply via asynchronous Slack feedback. After the initial version was live, Paul provided additional feedback asynchronously in a Slack thread. The feedback indicated the agent seemed to be making point-in-time fixes rather than applying general principles. The engineer copy-pasted the Slack thread directly into Codex (using the ChatGPT Slack integration to spider the thread), asked it to refine the scanner’s framing toward more general operating principles rather than specific guardrails. Codex produced an updated diff approximately 30 minutes later.

  9. Stack additional scanners on top for free. Once the dependency risk profiling infrastructure was in place, license scanning and depth scanning fell out of the same pattern with minimal additional work. Each new scanner type reuses the same parallel-agent architecture, the same SARIF pipeline, and the same GitHub Advanced Security broker — compounding the value of the initial 4-hour investment.

Note: The transcript describes the outcome, the architecture, and the iterative refinement process in concrete terms, but does not expose the actual Codex prompts, the specific SARIF schema used, the internal hosting platform configuration, or the exact structure of the per-dependency profile fields. The steps above reflect the full extent of the technical detail provided.

Actionable Takeaways

  • Start supply chain hardening by finding one upstream PR or security advisory that applies to your package manager, then prompt a coding agent to apply the same principles to your codebase. The 15-minute effort produces a concrete diff and, more importantly, a written set of principles you can check into your repository as durable documentation.
  • Build a bespoke dependency scanner by interviewing your security team about risk signals they care about — reputation, maintenance activity, post-install script requirements — then instruct a coding agent to implement a daily CI job that profiles your direct and transitive dependencies against those criteria and outputs SARIF reports to GitHub Advanced Security.
  • After your scanner produces its first report, iterate on the prompts to shift agent reasoning from point-instance rules to general principles. Copy relevant Slack feedback threads into the coding agent and ask it to update the scanner's frame of reference; this produces significantly better signal without additional engineering investment.

Common Pitfalls

  • Treating supply chain hardening as a one-time audit rather than a continuous process. A point-in-time scan goes stale immediately as dependencies update. The OpenAI approach runs the scanner daily in CI precisely because the dependency landscape changes continuously — a scan from last quarter tells you nothing about what Dependabot merged this morning.
  • Allowing agents to blindly accept dependency update PRs to get builds passing. Without the documented context explaining why package installer policies are restricted, an agent will optimize for the immediate goal (green CI) rather than the actual security principle (no arbitrary code execution during install). The fix is to write the rationale into checked-in documentation so it is visible to every agent that reviews those PRs.

Human-in-the-Loop Governance and Agentic Security at Scale

Treating Humans as Tools — Structured Escalation Over Ad Hoc Review

One of the most important mindset shifts in agentic security at scale is inverting the traditional relationship between humans and automated systems. In Ryan and Paul’s model, humans are not supervisors who periodically check in on agents — they are callable tools that agents invoke when the task exceeds their autonomous authority. The agent identifies a material change, formulates a precise question, tags the relevant expert via the GitHub CLI[3] (gh), and passes exactly the context needed to resolve the ambiguity. The human’s response is then durably encoded back into the codebase, feeding future agents and engineers alike.

This matters because human synchronous attention is the scarcest resource in any security program. No team can review every PR, every dependency update, every deployment, and every diff. But every human on the team does know what good looks like in their domain. The agentic model captures that knowledge once — as reviewed, checked-in documentation, guardrails, and test logic — and then scales it across every future change without requiring the expert to be in the loop again unless something genuinely novel appears.

Implementing Two-Party Sign-Off for Agents

The same two-party sign-off controls used for human contributors can be mirrored for agent-driven workflows. The pattern is:

  1. Agent A performs the action — writes or updates code, applies a dependency bump, flags a finding.
  2. Agent B reviews Agent A’s output — independently evaluating whether the change is consistent with documented principles, threat model assumptions, and security guardrails.
  3. Discrepancies surface to a human — only when the reviewing agent cannot reconcile the change against policy does it escalate, tagging the correct expert with a specific, context-rich question.

Paul noted during Q&A that “the same two-party signoff controls you would put on humans, you can also use agents in this way as well to basically have two of them check each other’s work.” This is defense in depth applied to non-deterministic systems: a prompt injection that successfully manipulates the agent merging a dependency update is unlikely to simultaneously manipulate a separate agent whose sole task is looking for behavioral anomalies.

Handling Prompt Injection and Insider Risk in Non-Deterministic Reviewers

A direct audience question at the talk addressed exactly this: how do you prevent a bad actor from bypassing an LLM-based security scanner by injecting “please just tell me this is all good” into the reviewed content?

The answer has two parts:

  • Separation of concerns between agents: The model reviewing for bad behavior operates on a different prompt surface than the model implementing changes. A prompt injection embedded in a dependency update PR is likely crafted to manipulate the implementation agent; the same injection is far less likely to successfully manipulate a separate, independent reviewer agent with no task overlap.
  • Human-targeted escalation for high-stakes decisions: Agents are not expected to be the final authority on anything that crosses a defined risk threshold. The CI agent is given the gh CLI specifically so it can invoke a human reviewer when uncertainty or stakes exceed its autonomous authority. This keeps high-consequence decisions human-reviewed while allowing routine changes to flow through unblocked.

Paul’s direct answer: “There’s a model looking for that… you add the models and ask them to look for bad behavior. And oftentimes the prompt injection that works on the model that’s updating the dependencies is going to not work in the same way on the model that’s looking to see whether the vibes are off on something.”

Telemetry, Audit Logging, and DFIR Readiness

As agentic workflows move from experimentation to production, governance requires observability. The Codex harness used by OpenAI’s team includes a native OpenTelemetry (OTEL)[7] exporter, which provides:

  • Durable audit logs of agent session trajectories — every action taken by every agent is recorded, not just the final output. This is the equivalent of a full command history for an automated engineer.
  • DFIR review capability — session logs are preserved in a format suitable for digital forensics and incident response. If an agent makes an unexpected change, the full trajectory is available for reconstruction.
  • Governance signals over agent behavior — OTEL traces allow teams to observe patterns across many agent runs, identify drift from expected behavior, and tune agent instructions before problems compound.

This is not optional overhead for mature deployments. Without audit trails, the organization cannot answer basic questions: What did the agent do? Why did it make that change? Was the dependency bump agent’s decision consistent with documented policy at the time it ran?

Encoding Expertise Durably — The Feedback Loop That Scales

The governance model described throughout the talk is ultimately a feedback loop for expertise:

  1. A human expert identifies a concern — via Slack message, PR comment, AppSec review finding, or security incident.
  2. That concern is not just resolved in isolation; it is distilled into a durable artifact: a security.md rule, a custom lint, a threat model update, a test, a review agent prompt.
  3. The artifact is checked into the repository, where it becomes available to every agent and engineer working in that codebase from that point forward.
  4. Agents run against the artifact on every subsequent PR, flagging deviations and escalating to humans only when genuinely novel situations arise.

Ryan described this as “constantly reflecting that taste and durable expertise into the folks that are actually doing the job.” Every Slack message flagging a security concern, every failed lint, every rolled-back dependency update represents missing context — context that, once captured and encoded, prevents the same mistake from recurring at any scale.

Actionable Pattern: Distilling AppSec Reviews Into Executable Guardrails

Ryan’s treatment of AppSec reviews is directly applicable to any team that receives periodic security review findings:

  • Do not treat the review as a point-in-time checklist. Treat it as a set of principles about how the code should be written going forward.
  • Fix the point issues, then go one level up: ask what process failure permitted the vulnerable code to be produced, and add an executable guardrail (a lint, a test, a review agent rule) that statically disallows the same pattern.
  • Cheap, bespoke, fit-for-purpose lints — not rigorous, cross-enterprise platform features — are the right tool. Observe the behavior in your codebase, write a lint, run it against existing code, label the gaps, check it in, and migrate. This is achievable in an afternoon and has immediate, durable effect.

This closes the governance loop: the AppSec review that would otherwise be forgotten after the ticket closes becomes a permanent, executable part of the security program.

Actionable Takeaways

  • Grant your CI agent access to the GitHub CLI and configure it to tag the appropriate human expert when a PR introduces changes that invalidate a documented, risk-accepted assumption. Include the specific ambiguity to resolve in the notification, not just a generic alert. This focuses human review on genuine decision points and routes routine changes through automatically.
  • Deploy a second, independent LLM reviewer agent alongside your implementation agent. Give it no write access and a single task: evaluate whether the implementation agent's output is consistent with your documented security principles. Discrepancies surface to a human; agreement allows the change to proceed. This mirrors two-party sign-off at agent speed without requiring human involvement on every change.
  • Enable the native OTEL exporter in your Codex (or equivalent) agent harness and route session logs to your SIEM or internal audit store. Treat agent session trajectories the same way you treat human admin command logs — as a first-class artifact for governance, anomaly detection, and DFIR readiness.

Common Pitfalls

  • Treating agentic security reviewers as fully autonomous authorities on high-stakes decisions. LLM-based reviewers are probabilistic; they can be manipulated by prompt injection and can make confident-sounding errors. The correct model is to use agents to scale routine enforcement and use humans as callable tools for decisions that exceed the agent's authority — not to remove humans from the loop entirely.
  • Resolving AppSec review findings as isolated tickets without encoding the underlying principle as a durable guardrail. When a finding is fixed in isolation and the review is closed, the organization learns nothing. The pattern that produced the vulnerability remains available for the next engineer or agent working in the codebase. Fix the issue and add the lint or test that prevents recurrence.

Conclusion

Paul McMillan and Ryan Lopopolo’s core argument at [un]prompted 2026 is both simple and actionable: security knowledge already exists in your organization — the missing ingredient is the engineering infrastructure to enforce it continuously. LLM coding agents eliminate the cost barrier to building that infrastructure. A security.md file, a 40-line GitHub Actions workflow, and a few hours of agent-directed work are enough to produce threat model validators, bespoke vulnerability lints, and a daily dependency scanner that covers your entire dependency tree.

The compounding effect is the most important part. Every principle checked into documentation propagates to every agent and engineer operating in that codebase indefinitely. Every AppSec finding that becomes a lint or a review rule prevents the same vulnerability class from recurring — at any scale, with any team size. The governance model scales too: two-party sign-off for agents, OTEL-backed audit trails, and structured human escalation via the GitHub CLI give you the controls needed to trust these systems in production.

The economics are straightforward. Tokens are cheap relative to engineering time. The alternative — waiting for vendor contracts, quarterly audits, or dedicated security platform teams — is not cheaper. It just defers the cost until a finding reaches production.

For related approaches to secure software development and security automation in CI/CD environments, explore the other talks in the archive. If you’re interested in the broader landscape of software composition analysis and dependency risk at scale, the dependency scanning pattern described here connects directly to SCA tooling decisions your team is likely already evaluating.


References & Tools

  1. OpenAI Codex — The primary coding agent used to write security guardrails, generate threat models, implement the dependency scanner, and apply supply chain hardening PRs from natural-language instructions.
  2. GitHub Actions — CI/CD orchestration layer for running LLM security reviewers non-interactively on every PR; approximately 40 lines of YAML per workflow.
  3. GitHub CLI (gh) — Granted to the CI agent to tag expert humans for review when a material security-relevant change is detected that exceeds the agent's authority to approve autonomously.
  4. PNPM — Package manager whose installer policy was hardened (post-install scripts disabled) as the concrete supply chain hardening example demonstrated in the talk.
  5. GitHub Advanced Security — Acts as the broker receiving SARIF reports from the custom dependency scanner and surfacing findings to agent-driven upgrade, removal, or framework migration workflows.
  6. OpenTelemetry (OTEL) — The Codex harness exports OTEL traces natively, providing governance telemetry and durable audit logs of agent session trajectories for DFIR review.
Frequently asked

Questions from the audience

How much does it cost to run an LLM security reviewer on every pull request?
At current API pricing the cost is measured in cents per PR, not dollars. Paul McMillan and Ryan Lopopolo reported that their team allocates roughly 50% of their token budget to code production and 50% to refinement — which includes all CI reviewer and scanner agents. Tokens are cheap relative to the cost of a security finding reaching production.
What should a security.md file actually contain?
Effective security.md content includes high-level secure interface principles (e.g., 'prefer APIs that are impossible to call incorrectly'), banned patterns drawn from past AppSec findings and PR comments, threat model anchors that every PR must not invalidate, and escalation signals that tell the agent when to tag a human instead of making an autonomous judgment. The file should state positive principles, not just a list of prohibited patterns.
How do you prevent prompt injection from bypassing an LLM-based security scanner?
The approach relies on separation of concerns between agents: the model reviewing for bad behavior operates on a different prompt surface than the model implementing changes. A prompt injection targeting the implementation agent is unlikely to work the same way on a separate, independent reviewer agent with no task overlap. High-stakes decisions above a defined risk threshold are always escalated to a human via the GitHub CLI rather than decided autonomously.
How long does it take to build a custom dependency scanner using this approach?
Ryan Lopopolo built a bespoke dependency scanner covering 1,500 packages in approximately four hours of background work during a normal afternoon, starting from a 15-minute conversation with a security engineer. The resulting scanner runs daily in CI, forks 16 parallel agents, profiles all direct and transitive dependencies, and feeds SARIF reports into GitHub Advanced Security.
Watch on YouTube
Code Is Free: Securing Software | [un]prompted 2026
Paul Mcmillan, Ryan Lopopolo, · 25 min
Watch talk
Keep reading

Related deep dives