
A machine learning model correctly classifies 99% of inputs — but an adversary has found the 1% that causes it to confidently misclassify, and 72–85% of such perturbation attacks evade conventional security controls entirely. AI security threat modeling exists precisely to surface these failure modes before attackers do, yet most security teams apply traditional frameworks that were never designed for probabilistic systems.
This post breaks down Dr. Nitish Uplavikar’s Project Guardrail[1] — an open-source framework synthesizing 429 AI/ML threats into a tiered questionnaire system cited by OWASP, NIST, and OECD — giving you a structured approach to risk-ranking any AI/ML deployment.
Key Takeaways
- You'll learn how adversarial perturbation attacks silently manipulate AI model outputs — and why 72–85% of such attacks go undetected by conventional security controls.
- You'll be able to identify why traditional security testing standards fall short for AI systems and what a structured, risk-ranked approach to AI security looks like in practice.
- Apply the Guardrail framework's three-tier questionnaire (Baseline, Continuous Learning, User Data Interaction) to systematically surface unique security threats in any AI/ML application you're securing.
Adversarial Attacks on AI Systems: How Perturbation Exploits ML Models
AI security threat modeling begins with understanding what makes machine learning systems fundamentally different to attack. Unlike traditional software, where vulnerabilities arise from implementation bugs, AI/ML models can be exploited through their inputs alone — with no code modification required. This is the adversarial perturbation attack: the deliberate injection of carefully calculated, imperceptibly small noise into an input that causes the model to produce a wildly incorrect output.
The implications are severe. A human reviewing the perturbed input sees nothing unusual. The modification is visually indistinguishable from the original. Yet the model’s classification is completely wrong — and in safety-critical deployments, that means real-world harm.
Cyber-Physical Systems: When Misclassification Has Physical Consequences
Deep neural networks are now embedded in cyber-physical systems — autonomous vehicles, traffic analysis systems, and industrial controllers — where misclassification doesn’t just produce bad data, it causes physical events. The talk highlights how adversaries can add perturbation noise to road sign or pedestrian detection inputs to induce systematic misclassification in self-driving vehicles.
The scale of the problem is striking: 72 to 85% of these adversarial perturbation attacks go undetected. This isn’t a marginal edge case. The majority of attacks succeed silently, without triggering any existing monitoring or alerting. For security engineers used to working with systems where attack artifacts leave detectable traces, this detection gap is a fundamental shift in threat posture.
PoC: Adversarial Perturbation Attack on Autonomous Vehicle Perception Systems
Adversaries inject imperceptibly small noise into the visual inputs of deep neural networks powering autonomous vehicle perception — causing systematic misclassification of road signs, pedestrians, and obstacles while remaining undetectable to human observers, with 72–85% of such attacks going undetected by existing controls.
Proof of Concept Steps:
- Identify the target model — a deep neural network in a cyber-physical system classifying road signals and obstacles in real time.
- Establish the threat surface — the model ingests raw sensor data from the physical environment; perturbations can be introduced physically (adversarial patches on road signs) or digitally (intercepting sensor feeds).
- Craft the adversarial perturbation — compute a minimal noise vector that maximizes model classification error while remaining imperceptible (bounded by a small L-infinity norm).
- Apply to input — add the noise pixel-by-pixel. The result is visually indistinguishable from the original to human observers.
- Observe misclassification — the model confidently outputs an incorrect label (e.g., stop sign misidentified as speed limit sign; pedestrian not detected).
- Physical consequence propagates — the vehicle’s control system acts on the incorrect classification, causing incorrect braking, signal interpretation, or evasive action.
- Detection gap confirmed — monitoring systems designed to flag hardware failures or out-of-bounds values produce no alert; the input is within normal value ranges.
PoC: Plane-to-Frog Misclassification via Adversarial Noise Injection
A supervised learning image classifier trained on clean data confidently misclassifies a plane image as a frog when adversarial noise is injected — with no access to model weights, no training data modification, and no software vulnerability required.
Proof of Concept Steps:
- Establish baseline — a supervised learning classifier is trained on a clean dataset and correctly classifies images (plane → “plane”).
- Select target image — take a legitimate, correctly classified image of a plane.
- Generate adversarial noise — compute a perturbation that maximizes loss for the correct class while keeping pixel shifts imperceptible. This is not random noise; it is optimized against the model’s decision boundary.
- Construct adversarial example — add the noise to the original image. A human observer still sees a plane.
- Submit to model — pass the perturbed image to the trained classifier.
- Observe misclassification — the model outputs “frog” with high confidence.
- Key insight — no modification to code, weights, or training data was needed. The attack executes through the model’s normal inference interface, making it viable against any model reachable via API.
Why This Demands a New Security Standard
Traditional application security assumes the code is the attack surface. Patch the bug, mitigate the CVE. Adversarial ML attacks invalidate this assumption entirely. The code is correct. The model is functioning as trained. The attack vector is the model’s learned decision boundary — which cannot be patched in the traditional sense.
Actionable Takeaways
- When threat modeling any AI/ML application that processes external inputs (images, text, sensor data), explicitly add adversarial perturbation to your attack tree. Document what percentage of the input space the model is exposed to from untrusted sources and what the classification failure consequence is.
- Do not rely on human review as a detection control for adversarial inputs — the defining characteristic of these attacks is that the perturbation is imperceptible. Detection must be model-level (e.g., input preprocessing defenses, adversarial training, or separate anomaly detection on input distributions).
- For AI systems embedded in cyber-physical contexts (robotics, autonomous vehicles, industrial control), classify adversarial misclassification as a safety risk, not merely a data quality risk. This changes the risk tier and the verification controls required.
Common Pitfalls
- Assuming that if an input "looks normal" to a human reviewer it is safe to process. Adversarial perturbations are specifically engineered to be invisible to human inspection while maximally disrupting model behavior — human review is not a compensating control.
- Treating AI security as a subset of traditional software security. Applying only standard SAST, DAST, or pentesting techniques to an AI/ML system will miss the entire adversarial input attack surface, producing a false sense of coverage.
Why AI Security Testing Standards Are Insufficient
One of the most important — and underappreciated — realities for security engineers working with AI systems is this: passing a security standard does not mean an AI application is secure. The testing frameworks and standards that exist for AI are structurally deficient in three distinct ways.
Three Structural Failures of Current AI Security Standards
1. Standards are not comprehensive. A security standard designed for one class of AI application may not transfer to another. The threat surface of a natural language processing model differs significantly from that of a computer vision model or a reinforcement learning agent. Standards built around one application type create false assurance when applied to another. There is no universal AI security baseline that covers all model architectures, training paradigms, and deployment contexts.
2. Tests are not exhaustive. For traditional software, exhaustive testing is the target — covering all code paths and input conditions. For AI systems, exhaustive testing of the input space is mathematically impossible. The input space of a neural network is continuous and high-dimensional. AI security tests are always sampling a finite subset of an infinite attack surface, and any standard built around test suites inherits this fundamental limitation.
3. Standards lack constituent testing. In traditional systems, you can test individual components in isolation — unit test a function, fuzz a parser. AI systems resist this decomposition. The emergent behavior of a neural network arises from the interaction of billions of weights; you cannot meaningfully test a model sub-component in isolation. Most AI security testing happens at the system or abstract level, meaning component-level vulnerabilities can go undetected even when system-level tests pass.
The Asymmetry That Makes This Dangerous
The consequence: the bar to defend is high, but the bar to attack is low. LLMs have made attack scaling dramatically cheaper — adversaries generate, vary, and deploy attacks with minimal effort. Meanwhile, defenders work against a framework that is acknowledged to be incomplete, non-exhaustive, and abstract. This asymmetry should directly inform how you prioritize AI-specific controls over compliance checkboxes.
Actionable Takeaways
- When evaluating an AI vendor's security posture, do not treat standard compliance (SOC 2, ISO 27001, or any AI-specific standard) as evidence that the AI model itself is secure. Ask specifically what adversarial testing has been conducted against the model, not just the surrounding infrastructure.
- Design AI security programs with explicit acknowledgment that no test suite can be exhaustive. Build in continuous monitoring of model behavior in production — behavioral drift, confidence score anomalies, and unusual output distributions are runtime signals that complement static test coverage.
Common Pitfalls
- Treating AI security compliance as equivalent to AI security assurance. Compliance demonstrates that known checklist items were addressed; it says nothing about unknown attack vectors, model-specific vulnerabilities, or the adversarial robustness of the model itself.
Offensive vs Defensive ML Security: Two Problem Categories Every Engineer Must Know
When building a application security threat model for any system that intersects with AI, you need to reason across two distinct problem categories. Conflating them leads to threat models that are either over-scoped or miss critical attack vectors entirely.
Category 1: Offensive ML — using machine learning as a weapon against other systems. Category 2: Attacking ML — targeting AI/ML models themselves as the victim.
Category 1: Offensive ML — AI as the Attack Tool
In the offensive ML category, the AI system is in the adversary’s hands. Two concrete examples from the talk:
Faster password guessing: ML models trained on leaked credential datasets generate statistically informed password guesses that outperform traditional dictionary attacks. The model learns the probability distribution of real passwords — increasing attack throughput without proportional effort.
Sandbox environment detection: Security vendors use sandboxes to analyze malware in isolation. Adversaries train ML models to fingerprint sandbox environments by detecting hardware signatures, timing characteristics, and behavioral artifacts that distinguish sandboxes from genuine user machines. When the malware detects a sandbox, it suppresses its malicious behavior — passing dynamic analysis clean while activating its payload only on real endpoints.
PoC: ML-Powered Sandbox Evasion: Malware Environment Detection
Adversaries embed ML models into malware to classify the execution environment as sandbox or real endpoint — suppressing malicious behavior in sandboxes and activating payloads only on genuine targets, causing dynamic analysis to return false-negative results.
Proof of Concept Steps:
- Identify the control being evaded — sandbox-based dynamic analysis, where behavior (network calls, file ops, registry writes) is observed and flagged if malicious.
- Build the fingerprinting model — train an ML classifier on environmental features that differ between sandboxes and real machines: CPU/RAM footprint, timing overhead from emulation, presence of sandbox-specific process names and registry keys, absence of user activity signals (mouse movement, browser history).
- Embed in malware — compress and obfuscate the trained model; bundle it as a pre-execution check inside the malware payload.
- Runtime classification — at execution, the malware collects environmental features and runs them through the embedded model. Output: sandbox or real environment.
- Conditional activation — if sandbox detected: execute benign-looking operations only; if real environment: activate malicious payload (exfiltration, encryption, lateral movement).
- Result — sandbox analysis returns clean; the malware is cleared and may execute on production endpoints.
Category 2: Attacking ML — The AI Model as the Target
Here, the AI/ML model itself is the victim. The canonical example is the ProofPoint CVE — one of the first CVEs issued specifically for an ML application.
PoC: ProofPoint CVE — Model Extraction Attack Against AI Phishing Detection
Adversaries successfully extracted a functional approximation of ProofPoint’s proprietary AI phishing detection model through black-box query access, then used the extracted model to craft adversarial emails that bypassed the original detection — demonstrating that AI security controls are attackable via their normal operational interface.
Proof of Concept Steps:
- Identify the target — ProofPoint’s AI-based email security solution, classifying emails as phishing, spam, or benign. Proprietary model; no direct access to weights or training data.
- Establish black-box access — the adversary only needs the operational interface: the ability to submit emails and observe classification outputs.
- Probe systematically — submit a large volume of varied email inputs; record input/output pairs. Vary sender domains, content structure, link patterns, and keywords to map decision boundaries.
- Build a surrogate model (model extraction) — use the collected input/output pairs as a training dataset to train a copycat model that approximates the original’s decision logic.
- Craft adversarial bypass inputs — with the surrogate under full control and gradients accessible, generate phishing emails specifically engineered to be classified as benign by both the surrogate and the original.
- Bypass the production control — adversarially crafted phishing emails pass the live ProofPoint filter as benign.
- CVE significance — formalized as a CVE, establishing that AI models used as security controls are CVE-worthy attack surfaces and that model extraction via black-box API is a production-grade technique.
Scoping Implications for Security Engineers
- Defending an AI application: threat model must include Category 2 — adversarial inputs, model extraction, data poisoning, evasion.
- Defending non-AI applications where adversaries use AI: threat model must account for Category 1 — AI-enhanced phishing, AI-powered credential attacks, sandbox evasion.
- Both may apply simultaneously in complex environments.
Actionable Takeaways
- Explicitly label each AI/ML security concern in your threat model as either "Offensive ML" (adversary using AI as a weapon) or "Attacking ML" (adversary targeting your AI model). This prevents scope confusion and ensures defenses are matched to the correct threat category.
- For any AI system used as a security control (spam filters, fraud detection, anomaly detection, UEBA), add model extraction and evasion to the threat model. An AI system used to enforce security policy is an especially high-value target for Category 2 attacks — successfully evading it directly achieves the adversary's goal.
- Reassess sandbox-based malware analysis as a standalone control. ML-powered sandbox evasion is a demonstrated technique; supplement dynamic analysis with behavioral controls that are harder to fingerprint (network-level telemetry, long-duration observation, hardware-level monitoring).
Common Pitfalls
- Scoping an AI security assessment only around the infrastructure hosting the model (network controls, authentication, encryption at rest) while ignoring the model's behavior as an attack surface. Infrastructure security and ML model security are complementary but non-interchangeable — a model can be exploited through normal API usage with no infrastructure breach.
- Assuming that because a model is accessed only via an API (not directly downloaded), model extraction attacks are not feasible. Model extraction via black-box query access is well-documented and was the mechanism behind the ProofPoint CVE — API access alone is sufficient.
A Risk-Based Framework for Securing AI Applications End-to-End
Given the unique challenges of AI security — incomplete standards, exhaustive-testing impossibility, and a rapidly expanding threat landscape — organizations need a structured, repeatable process. The talk outlines a four-phase framework for systematically securing AI/ML applications.
Phase 1: Identify — Inventory Your AI Applications
You cannot secure what you haven’t identified. The Identify phase requires enumerating every application in your environment that uses AI or ML components — including third-party integrations and vendor-supplied AI features that may not be obviously labeled as AI. This inventory is the prerequisite for everything that follows.
Phase 2: Risk Rank — Prioritize by Consequence
Not all AI applications carry equal risk. Apply triage logic: evaluate each application against its potential impact if compromised and prioritize accordingly. A practical heuristic from the talk: applications that process PII receive higher priority than those handling lower-sensitivity data. Risk ranking prevents the failure mode of applying uniform security effort across all AI applications regardless of actual exposure.
Phase 3: Find Threats — AI-Specific Threat Modeling
This is where standard threat modeling methodologies (STRIDE, PASTA, attack trees) reach their limits — they were not designed to surface AI-specific threats. Phase 3 requires threat modeling that accounts for the ML attack surface: data pipeline, training process, model artifact, inference inputs, and system infrastructure. Project Guardrail[1] operates here.
Phase 4: Verify — Active Testing and Red Teaming
Translate the threat model output into active verification using:
- Adversarial testing — deliberately probing the model with crafted inputs designed to induce misclassification or evasion
- Static code analysis — reviewing ML codebases (training scripts, data preprocessing, inference pipelines) for implementation vulnerabilities
- Red teaming — structured adversarial exercises targeting the AI system’s behavior, not just surrounding infrastructure
Build vs Buy: Security Implications
The framework also surfaces a strategically important decision. Building your own AI application means full visibility and control over security architecture — and full accountability. Using a third-party vendor shifts risk management responsibility to the vendor but reduces visibility. Mitigation requires vendor oversight processes, security requirements in contracts, and mechanisms to validate vendor AI security practices. Neither approach is inherently superior, but the security implications of each path must be explicitly considered.
Actionable Takeaways
- Run an AI application inventory before starting any AI security program. Include third-party SaaS products with embedded AI features (e.g., AI-powered email filtering, AI-based fraud detection in payment platforms). Shadow AI adoption is common and creates untracked risk.
- Apply PII processing as a first-pass risk ranking criterion for your AI application inventory. Any AI model that ingests, trains on, or produces PII outputs should be in the top tier for threat modeling and adversarial testing.
- For vendor-supplied AI applications, formalize the security review as part of the vendor onboarding process. Define minimum requirements for AI-specific security controls (model validation, adversarial robustness, data handling) in vendor contracts rather than relying on generic security questionnaires.
Common Pitfalls
- Skipping the Identify phase and jumping directly to threat modeling a subset of known AI applications. Organizations routinely underestimate the extent of AI adoption in their environment — particularly through third-party integrations and shadow AI use by business units. An incomplete inventory produces an incomplete threat model.
- Treating traditional security tools (SAST, DAST, penetration testing) as sufficient for the Verify phase of an AI application. They do not cover adversarial ML testing of the model's behavior — a dedicated AI red teaming or adversarial testing capability is required.
AI Threat Modeling with Project Guardrail: Methodology and Questionnaire System
Project Guardrail[1] is an open-source AI/ML security threat modeling framework developed at Comcast. Its core function is to systematically surface security threats unique to AI applications through a structured questionnaire system — giving security architects a reproducible, evidence-based tool for the Find Threats phase of the AI security lifecycle. It is referenced by the OWASP Top 10 for LLM Applications[2], the NIST AI Risk Management Framework Playbook[3], and the OECD Catalog of Tools and Metrics for Trustworthy AI[4].
How the Threat Library Was Built
Guardrail’s credibility rests on its methodology — a systematic literature review, not a first-principles list.
Step 1 — Literature Review: Surveyed digital libraries, academic sources, and reports from credible organizations for existing AI/ML threat catalogs.
Step 2 — Inclusion Criteria Filtering: Excluded libraries that did not address privacy and security content, and excluded those covering threats at a national or societal level rather than the application level. This yielded 14 curated source libraries.
Step 3 — Unified Threat Extraction: Extracted all unique threats across the 14 sources, producing an initial pool of 429 threats. Duplicates removed; unique threats retained.
Step 4 — Threat-to-Question Conversion: Each unique threat was converted into a questionnaire item a security architect can pose during a threat modeling engagement — transforming a research artifact into a practitioner tool.
Step 5 — Expert Validation: 10 expert interviews conducted to validate questions for reliability and appropriateness, then refined for clarity.
Step 6 — Categorization: Questions organized by application type and by where within the application the threat manifests (data, model, artifact/output, system infrastructure).
The final questionnaire contains approximately 47 questions — a distillation of 429 raw threats through systematic filtering, conversion, and expert review.
The Three-Tier Questionnaire Structure
Guardrail structures its questionnaire into three progressive, additive tiers.
Tier 1: Baseline (Required for all AI/ML applications) Applies to any AI/ML system regardless of architecture or deployment context. Questions are subdivided by threat location:
- Data modules — training data, pipelines, preprocessing (data poisoning, supply chain attacks on datasets)
- Model — architecture, training process, learned weights (backdoor injection, model inversion)
- Artifact/Output — inference output layer (adversarial evasion, output manipulation)
- System Infrastructure — hosting environment and operational infrastructure
Tier 2: Continuous Learning (Applied if the model updates incrementally) If the application uses online learning — updating weights based on production data — additional questions apply. Continuous learning systems have a materially larger attack surface: an adversary can influence production model behavior by manipulating the data the model learns from in real time.
Tier 3: User Data Interaction (Applied if the application processes user data or PII) If the application processes PII or user-generated data, a third tier applies. The speaker identified this tier as producing the most vulnerability-revealing findings — specifically the question chain starting with: “What PII are you processing?” — which opens structured follow-on questions about how data flows through the model, whether it influences training, and what controls govern its use.
How to Apply Guardrail in Practice
- Identify the application type — Does it use continuous learning? Does it process user data?
- Select applicable tiers — always start with Baseline; add Continuous Learning and/or User Data Interaction as appropriate.
- Work through the questionnaire — answer each question for the application under review.
- Map answers to threats — gaps or concerning design patterns map back to the underlying threat library.
- Feed findings into the Verify phase — questionnaire findings become test cases for adversarial testing and red teaming.
Current Status and Roadmap
At the time of the talk, Guardrail covers the manual questionnaire component. The planned next phase is adversarial security testing automation: systematic cataloging of AI adversarial threats, surveying available testing tools, identifying related CVEs, and building automated testing capabilities. The project has been presented at Grace Hopper Conference and LASCON, and the underlying research has been peer-reviewed.
Actionable Takeaways
- Use Guardrail's three-tier structure as the organizing framework for your next AI/ML threat modeling engagement. Start with the Baseline questions for every AI system. Layer in Continuous Learning questions if the model updates in production. Add User Data Interaction questions if PII is involved. This tiered approach ensures proportionate coverage without over-scoping.
- Prioritize the User Data Interaction questionnaire tier for any AI application that ingests user data. The speaker identified this tier as producing the most vulnerability-revealing findings — specifically the chain of questions around what PII is processed, how it flows through the model, and whether it influences training.
- Fork or reference the Guardrail GitHub repository as a living input to your AI security program. Given its citations in OWASP, NIST, and OECD documentation, updates to the Guardrail threat library will track with the evolving AI threat landscape.
Common Pitfalls
- Applying only the Baseline tier to a continuous learning application. Online learning systems have a fundamentally different attack surface — real-time data poisoning is possible in ways it is not for static models. Skipping Tier 2 produces a systematically incomplete threat model.
- Using Guardrail's questionnaire output as the final deliverable rather than as input to adversarial testing. The questionnaire identifies threat exposure; it does not verify exploitability. Threat findings must be validated through active testing to distinguish theoretical risks from confirmed vulnerabilities.
Conclusion
The threat landscape for AI/ML systems is fundamentally different from traditional software security — and the gap between attacker capability and defender readiness is widening. Adversarial perturbation attacks succeed silently against the majority of targets, model extraction requires only API access, and ML-powered evasion is already being used to bypass security controls in production. Conventional testing standards and compliance frameworks are structurally insufficient to address these risks.
Project Guardrail[1] offers a practical, evidence-based starting point: 47 questionnaire items derived from 429 unique AI threats across 14 curated research libraries, organized into tiers that match the actual risk profile of your application. Its recognition by OWASP[2], NIST[3], and OECD[4] validates the taxonomy against the most widely adopted AI security and governance frameworks in use today.
For further reading on the topics covered here, explore our coverage of application security fundamentals, threat modeling methodologies, and adversarial machine learning.
References & Tools
- Project Guardrail — Open-source AI/ML security threat modeling framework by Comcast; provides a tiered questionnaire system (Baseline, Continuous Learning, User Data Interaction) derived from 14 curated research libraries and 429 unique AI threats. ↩
- OWASP Top 10 for LLM Applications — Industry-standard threat list for large language model applications; references Guardrail as a resource. ↩
- NIST AI Risk Management Framework (AI RMF) Playbook — US federal framework for managing AI risks; cites Guardrail as an applicable tool. ↩
- OECD Catalog of Tools and Metrics for Trustworthy AI — International catalog of AI governance tools; includes Guardrail. ↩
Questions from the audience
Related deep dives
Breaking AI Agents: Exploiting Managed Prompt Templates to Take Over Amazon Bedrock Agents
When Passports Execute: Exploiting AI Driven KYC Pipelines | [un]prompted 2026
Agents Exploiting Auth-by-One Errors | [un]prompted 2026