
Most enterprises evaluating AI security vendors today are trapped in a feedback loop: six vendor data sheets arrive, POCs run in parallel with no clear eval criteria, then stall when someone asks whether AWS Bedrock Guardrails already covers everything — and three months later the customer chatbot has shipped anyway. This is the AI security vendor evaluation problem in its most recognizable form, and it compounds as four new AI security startups emerge from stealth every six weeks.
This post breaks down how one practitioner built an agentic system to analyze over 3,000 vendor claims across 80 AI security vendors, map capabilities to a synthesized risk taxonomy, and produce a risk-assessment wizard that generates vendor shortlists in minutes. Security engineers and architects responsible for GenAI program governance will find a repeatable framework for structured evaluation that replaces gut-feel procurement with evidence-backed decisions.
Key Takeaways
- You'll learn how to cut through AI security vendor marketing by mapping claims to an evidence-backed risk taxonomy synthesized from OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS — so you can recommend solutions with documented confidence scores rather than marketing slicks.
- You'll be able to identify which inherent risks activate for a GenAI system based on its architecture patterns (RAG, tool-calling, agentic autonomy level, deployment layer) using a structured wizard approach — replacing months-long POC cycles with a 5-minute risk profile.
- Apply this to avoid the AI governance feedback loop — the endless cycle of vendor demos, stalled POCs, and deferred decisions — by establishing clear evaluation criteria and a reusable vendor capability matrix before engaging the market.
The AI Governance Procurement Feedback Loop
AI security vendor evaluation in enterprise environments is broken by design. The procurement cycle that most organizations follow was built for a slower-moving market — and it was never fast enough to keep pace with a landscape where four new AI security startups emerge from stealth every six weeks and established vendors bolt on GenAI security claims overnight.
The feedback loop begins the moment a client calls asking for AI security recommendations. Before any evaluation criteria exist, the conversation stalls on the wrong question: who’s good? — rather than what do we need to protect, and why? Without anchoring the discussion to use cases, risk tolerance, and the architecture being secured, vendor selection becomes guesswork dressed up as due diligence.
How the Loop Propagates
Once a minimal requirements conversation happens, the standard enterprise cycle unfolds in a predictable sequence — and predictably stalls:
-
Six vendor data sheets arrive. Marketing language dominates: “swarms of autonomous agents,” “broadest and most comprehensive,” “99% efficiency at sub-30 millisecond latency,” “AI-powered threat detection to automatically block all adversaries.” None of this maps to a control, a risk, or an architecture decision.
-
A risk register gets built by hand. Teams reach for OWASP LLM Top 10 or NIST AI RMF[1] and hand-jam requirements into GRC tools via spreadsheets. There is no purpose-built AI security risk framework that most teams trust, so every organization reinvents this step in isolation.
-
Four vendors are invited to demos; two get POC invites. POCs run in parallel — but with no clear evaluation criteria. What does “better” mean? What capability gap are we measuring? These questions rarely have written answers before the clock starts.
-
The POCs stall. Someone asks: “We’re building on AWS Bedrock[2] — doesn’t it already have Guardrails that cover most of this?” Then: “Our DLP vendor says they can do this too.” Then: “There are open-source tools — why not build it in-house?” Each competing answer restarts the requirements conversation without resolving it.
-
Three months pass. The decision is deferred. A consulting firm is brought in to build an AI security strategy from scratch. The evaluation process resets to step one.
-
The customer chatbot ships anyway. While the procurement loop continues, the GenAI system goes live without the security controls that were supposed to be in place before deployment.
Why This Keeps Happening
The loop is not a failure of intent — it is a structural consequence of three converging conditions:
-
Market velocity outpaces evaluation bandwidth. The AI security vendor landscape is growing faster than any team can track, making vendor selection feel perpetually premature.
-
No shared risk taxonomy. Without a consistent framework that maps GenAI system architectures to specific risks and required controls, every vendor demo is a comparison of incomparable claims.
-
Platform ambiguity creates infinite scope creep. Every enterprise already has Bedrock, a DLP tool, an endpoint agent, and a network gateway — and each vendor in the market claims to do something those platforms already do. Without a capability matrix that maps what existing investments actually cover, every new vendor demo reanimates the “do we even need this?” question.
The Cost of Staying in the Loop
The most concrete cost is not the wasted procurement time — it is that GenAI systems ship into production without security controls while the evaluation process continues. The customer chatbot that goes live during a stalled POC cycle is not a hypothetical. It is the default outcome when the structural dysfunction goes unaddressed.
For security engineers, recognizing this loop as a structural problem — not a vendor selection problem — is the prerequisite for solving it. The solution is not a better vendor comparison spreadsheet. It is a reusable risk taxonomy, a structured requirements wizard, and an evidence-backed capability matrix that can be run before the first vendor demo rather than after the third failed POC.
Actionable Takeaways
- Before engaging any AI security vendor, anchor the conversation to three explicit inputs: the GenAI system's architecture patterns (deployment layer, autonomy level, data classification), the organization's risk tolerance profile (incident avoidance vs. regulatory reporting priority), and the existing vendor investments that may already cover some controls. Without these inputs documented, every vendor demo will restart the requirements conversation rather than advance it.
- Establish written POC evaluation criteria — mapped to specific risk categories and capability controls — before inviting any vendor to a proof of concept. Ambiguous POC criteria are the primary mechanism that stalls evaluations and triggers the "do we even need this?" loop. Clear criteria also make it possible to identify which capabilities existing platforms (Bedrock Guardrails, DLP, endpoint agents) already cover, eliminating scope creep before it starts.
- Treat the AI security vendor landscape as a database to be maintained, not a market to be surveyed at point-in-time. With four new entrants every six weeks and established vendors adding bolt-on capabilities continuously, any evaluation built on a static vendor list degrades rapidly. A living capability matrix — even a simple one — prevents the team from restarting vendor research from scratch each procurement cycle.
Common Pitfalls
- Leading with "who's good?" instead of "what do we need?" collapses the entire procurement process into a vendor popularity contest. Without use-case and architecture inputs driving vendor selection, requirements are reverse-engineered from vendor capabilities rather than the other way around — producing a security program shaped by what vendors sell rather than what the organization's GenAI systems actually need.
- Running parallel POCs with no clear evaluation criteria creates a false sense of rigor while guaranteeing a stalled decision. The appearance of a structured process (two vendors, simultaneous POCs) masks the absence of measurable outcomes. When no criteria exist, any competing claim — from the platform team, the DLP vendor, the open-source advocate — can derail the evaluation because there is no defined standard against which to assess it.
Building a Synthesized AI Risk Taxonomy
The foundation of any defensible AI security vendor evaluation program is a shared, consistent language for risk. Without it, vendor claims become incomparable — one vendor’s “prompt injection protection” may map to a completely different threat model than another’s. To solve this, the framework described in this talk is built on a synthesized AI risk taxonomy that draws from three authoritative sources: OWASP LLM Top 10[3], NIST AI Risk Management Framework (AI RMF)[1], and MITRE ATLAS[4]. The taxonomy is published openly on GitHub and serves as the single classification layer against which all vendor capabilities are evaluated.
Why No Single Framework Is Sufficient
Each of the three source frameworks addresses a distinct slice of the AI risk taxonomy problem, but none covers the full picture on its own:
-
OWASP LLM Top 10 focuses specifically on LLM-layer vulnerabilities — prompt injection, insecure output handling, sensitive information disclosure, model denial of service, and related attack surfaces. It is highly actionable for application security teams building or consuming LLM-based systems, but its scope is bounded to the model and application interaction layer. It does not address adversarial ML attack techniques at the infrastructure level, nor does it provide the governance and lifecycle structure needed for enterprise risk management.
-
NIST AI Risk Management Framework (AI RMF) takes a governance and lifecycle perspective — it structures risk across the full AI system lifecycle (map, measure, manage, govern) and is well-suited for GRC teams building policy and control frameworks. However, it is deliberately framework-agnostic and does not enumerate specific attack techniques or vulnerability classes that practitioners need to evaluate vendor capabilities against.
-
MITRE ATLAS fills the adversarial ML gap. It catalogs attacker tactics, techniques, and procedures (TTPs) targeting AI systems — model evasion, data poisoning, model inversion, and supply chain attacks on ML pipelines. It is the most technically granular of the three for offensive AI security research, but it lacks the governance structure of NIST and the application-layer specificity of OWASP.
The synthesis covers all of today’s AI security risks by combining the LLM vulnerability taxonomy (OWASP), the governance and risk management structure (NIST), and the adversarial technique catalog (MITRE). The result is a unified risk classification that security engineers can use regardless of whether they are evaluating an application-layer guardrails product, a network-layer AI gateway, or a governance and audit tooling platform.
How the Taxonomy Enables Capability-to-Risk Mapping
Once the taxonomy is in place, every vendor capability can be mapped to one or more risk categories within it. Rather than comparing marketing claims directly against each other — which is a comparison of language, not substance — each claim is normalized to a taxonomy node.
For example, the risk category “sensitive data in AI inputs” maps directly to the capability need for input DLP and redaction. A vendor claiming to provide data loss prevention for LLM pipelines has a claim that can be evaluated against this specific taxonomy node. The system then surfaces which vendors map to that node, at which implementation layer (application via SDK/API, or infrastructure via network choke point or endpoint agent), and with what confidence score based on the evidence gathered.
This is demonstrated in the talk using the data disclosure risk category as a live example. Drilling into that node in the tool reveals:
- The specific risk: sensitive data exposure through AI inputs
- The required capability: input DLP and redaction
- Guidance for a custom in-house implementation
- A list of application-layer vendors (SDK/API deployment model) claiming this capability
- A larger list of infrastructure-layer vendors (network gateway or endpoint agent deployment model) claiming this capability
The same mapping logic applies to the CrowdStrike Falcon AI vendor card shown in the demo. For the data disclosure risk category, the tool shows that CrowdStrike has one mapped capability, deployable at both the application and infrastructure layers, with specific evidence sourced by the research agent.
Structure of the Synthesized Taxonomy
The taxonomy organizes risks hierarchically, with each node representing a distinct risk category that a GenAI system may be exposed to. Risk categories activate dynamically based on a system’s architecture inputs — data sensitivity, deployment model, level of autonomy, use of RAG, tool calling, and agentic behaviors.
The taxonomy is designed to be implementation-layer-aware. Many AI security products exist in two distinct deployment models:
-
Application layer — capabilities delivered through SDKs and APIs, integrated directly into the application development stack. These give development teams fine-grained control but require integration effort.
-
Infrastructure layer — capabilities deployed centrally at network choke points (gateways, proxies) or on endpoints via managed agents. These provide broader coverage without per-application integration but operate at a coarser granularity.
The taxonomy preserves this distinction because the evaluation question is not just “does this vendor cover this risk?” but also “at which layer can it be deployed, given our architecture?” For teams running on AWS Bedrock, native Guardrails capabilities cover a subset of application-layer risks — and the taxonomy makes those boundaries explicit, preventing the common pattern of POCs stalling when someone asks “doesn’t Bedrock already do this?”
Ongoing Taxonomy Maintenance
The taxonomy is not static. As new AI security vendors emerge — roughly four per six weeks at the time of this talk — and as existing cybersecurity platforms bolt on GenAI security features, new capabilities and risk categories appear. The agentic workflow that re-researches vendors nightly via GitHub Actions[5] also feeds into taxonomy coverage reviews, ensuring that newly observed vendor capabilities can be mapped to existing nodes or trigger additions to the taxonomy as the threat landscape evolves.
As of the talk, the taxonomy has been used to normalize and compare over 3,000 vendor claims across nearly 80 vendors. The consistency it provides is what makes that scale of analysis tractable — without a shared classification layer, each new vendor would require a bespoke evaluation effort.
Actionable Takeaways
- Before engaging any AI security vendor, map your GenAI system's risk surface to a unified taxonomy derived from OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS. This gives you a stable, framework-anchored classification layer that makes vendor capabilities directly comparable and prevents POC scope from drifting based on individual vendor framing.
- Structure your vendor evaluation by implementation layer — distinguish between application-layer capabilities (SDK/API integration) and infrastructure-layer capabilities (network gateway or endpoint agent). Many vendors offer only one deployment model, and your architecture may constrain which is viable. Making this explicit before demos prevents late-stage blockers.
- Use the taxonomy to expose coverage gaps in your existing vendor stack before going to market. For each relevant risk category, check whether AWS Bedrock Guardrails, your existing DLP, or your CNAPP platform already maps to it — then scope your vendor evaluation only to the genuine gaps. This eliminates the "don't we already have this?" question that derails POCs.
Common Pitfalls
- Relying on a single framework — OWASP, NIST, or MITRE alone — creates blind spots. Teams that build risk registers using only OWASP LLM Top 10 miss adversarial ML attack techniques cataloged in MITRE ATLAS; teams using only NIST AI RMF lack the specific vulnerability classes needed to evaluate technical vendor claims. The synthesis is necessary precisely because no single framework is complete.
- Treating taxonomy-to-capability mapping as a one-time exercise leads to stale evaluations. The AI security vendor landscape is growing at roughly four new entrants per six weeks, with existing platforms continuously adding GenAI-specific features. A taxonomy that is not maintained against new vendor capabilities will produce increasingly unreliable shortlists and miss emerging coverage options.
Agentic Vendor Claim Analysis at Scale
The Problem: 3,000+ Claims, 80+ Vendors, Zero Objective Evidence
The AI security vendor evaluation problem is fundamentally a data quality problem. When a VAR or internal security team tries to make purchasing recommendations, the raw material they work with is marketing slicks, data sheets, and demo environments — all controlled by the vendor being evaluated. AI security vendor claims like “99% efficiency at sub-30 millisecond latency,” “swarms of autonomous agents,” or “broadest and most comprehensive coverage” appear across every vendor deck without substantiating evidence. In December 2024, the team at Consortium began with over 2,000 such claims spanning 62 vendors. Six weeks later, that corpus had grown to more than 3,000 claims across nearly 80 vendors — because four new startups emerge from stealth every six weeks claiming to do AI security, alongside traditional cybersecurity vendors bolting on AI security features to existing platforms.
Manually QC-ing this volume is not feasible. The agentic approach outlined in the talk replaces that manual burden with a structured, automated pipeline that applies the same research methodology to every claim, at every vendor, on a continuous basis.
Architecture: Research Agent + QC Agent in an Agentic Loop
The system was built using Claude Code[6] (initially on Claude 3.5 Sonnet via AWS Bedrock, now on Claude 4.5+) as the orchestration layer. The architecture is intentionally simple:
-
Research Agent: Tasked with finding evidence that either supports or refutes a vendor’s stated capability claim. Evidence sources include GitHub repositories (code samples, implementation examples, changelogs), API documentation (feature specifications, parameter definitions, integration patterns), and user forums (community discussions, support tickets, real-world usage reports).
-
QC Agent: Reviews the research agent’s findings for quality, coherence, and relevance before the result is written to the vendor capability database.
-
System prompts and skills files: Refined over several months of iteration. These define the research methodology, the evidence taxonomy, and the confidence scoring rubric applied consistently across all vendor evaluations.
The two agents operate within a single Claude Code session — no external orchestration framework, no complex DAG runner. The orchestration is the agentic workflow orchestration loop itself.
Confidence Scoring: From Marketing Claim to Evidence-Backed Rating
The core output of the research loop is a confidence score assigned to each vendor claim. The scoring scale ties directly to the quality and type of evidence discovered:
-
5 out of 5 (highest confidence): The agent extracted working code samples from the vendor’s GitHub repository demonstrating how the claimed capability is actually instrumented — not described in a README, but implemented in callable code with identifiable patterns matching the claim.
-
Lower confidence ratings correspond to progressively weaker evidence: API documentation that describes the capability without implementation examples, community forum posts from users who report the feature working, marketing documentation without corroborating technical artifacts, or an absence of findable evidence entirely.
This confidence score is what distinguishes the system from a basic web scraper or keyword search. The agent is not just checking whether the vendor mentions a capability — it is grading how well the claim is substantiated by verifiable technical artifacts. A vendor claiming “input DLP and redaction at sub-30ms latency” gets a different score depending on whether the agent can point to an instrumented code path in a public repository versus a bullet point on a product page.
Mapping Claims to the Risk Taxonomy
After confidence scoring, each validated capability is mapped to the AI risk taxonomy described in the previous section. Each of the three frameworks “have really good components, but they don’t tell the whole picture” — the synthesis is what enables the system to map a vendor’s data disclosure capability to a risk category that spans the LLM attack surface, the infrastructure deployment context, and the governance accountability chain simultaneously. This is what makes capability-to-risk matching tractable at scale: every vendor claim, once scored, points to one or more taxonomy nodes, and those nodes are what the downstream wizard queries when generating vendor recommendations.
Handling Hallucinations and Context Drift
The talk is candid about a critical limitation: even after months of prompt refinement, hallucination and context issues remain a real operational concern. The system is described as working “not flawlessly” — human in the loop review is still required to manage the errors the agentic loop produces. This is an important signal for any team building similar evaluation tooling.
The practical implication is that the confidence scoring rubric itself serves as a partial mitigation for hallucination mitigation. By requiring the agent to cite specific evidence artifacts (a file path in a GitHub repo, a specific API endpoint in documentation, a forum thread URL), the system forces the model to ground its assessments in retrievable artifacts rather than generating plausible-sounding but unverifiable conclusions. Claims that cannot be grounded in a specific artifact receive a low confidence score rather than a hallucinated high one. The QC agent adds a second pass, but the architectural design of requiring artifact-level evidence citation is the primary control.
Continuous Freshness: GitHub Actions Nightly Re-Research
The AI security market changes too fast for a one-time research pass. The system addresses staleness through a GitHub Actions workflow that triggers nightly, running for approximately 12 hours and re-analyzing a rotating subset of vendors. The goal is for every vendor in the database to be re-analyzed at least once per month. A separate automated workflow handles new entrant detection — identifying newly emerged AI security startups or traditional cybersecurity vendors adding AI security capabilities to their platforms.
This continuous re-research loop is what makes the vendor capability database a living system rather than a static artifact. For teams building similar tooling, the architecture pattern is worth noting: the expensive per-vendor research pass runs incrementally on a schedule, keeping costs bounded while ensuring the recommendation engine never operates on stale capability data.
CrowdStrike Falcon AI Data Disclosure Capability Evidence Review
Proof of Concept
-
Navigate to the vendor capability matrix within the Adjuster IQ tool, open the risk taxonomy view, and filter to the “data disclosure / sensitive data and AI inputs” risk category — one of the core risk classes in the synthesized OWASP LLM Top 10 / NIST AI RMF / MITRE ATLAS taxonomy.
-
Open the CrowdStrike Falcon AI vendor card (via PNG acquisition). The tool displays the vendor mapped to the data disclosure risk category, showing a single capability claimed for this risk area.
-
Review the implementation layer mapping: CrowdStrike’s data disclosure capability operates at both the application layer and the infrastructure layer — deployable as an application-integrated control (SDK/API) and as a centrally managed infrastructure control (network choke point or endpoint agent).
-
Inspect the agent-gathered evidence: the research agent’s output is surfaced in the tool. This evidence was gathered autonomously by the Claude Code-based agentic loop, which searched GitHub repos, API documentation, and user forums to find substantiating evidence for the vendor’s data disclosure capability claim. The presenter noted that evidence is still subject to significant human QC due to hallucinations and context issues inherent in the agentic research process.
-
Assess the confidence score: the system assigns confidence ratings (1–5 scale) based on the quality and type of evidence found. A score of 5 indicates the agent extracted working code samples from GitHub repos showing how the capability is instrumented. The CrowdStrike data disclosure capability was reviewed but the specific confidence score was not verbalized in the demo.
-
Note the coverage context: even large platform vendors like CrowdStrike cover only approximately 50% of the AI security risks most relevant to a given GenAI system. A data disclosure capability at the application and infrastructure layers addresses part of the risk surface, but a complete AI security program requires additional controls from other vendors or in-house builds for the remaining gaps.
Actionable Takeaways
- When building vendor evaluation tooling, require the research agent to cite specific artifact-level evidence (a file path, API endpoint, or forum thread URL) for each capability claim rather than allowing the model to generate plausible but unverifiable assessments. This forces confidence scores to reflect actual evidence quality and significantly reduces hallucination risk in the output.
- Design the confidence scoring rubric around evidence type rather than evidence presence: a working code sample extracted from a public GitHub repo demonstrating capability instrumentation should score materially higher than an API doc description of the same feature, which should score higher than a marketing bullet point. Make these distinctions explicit in the scoring criteria passed to your research agent.
- Automate re-research on a rolling schedule using GitHub Actions or equivalent CI/CD tooling to prevent vendor capability data from going stale. In a market where four new AI security vendors emerge every six weeks, a database that is not continuously refreshed will produce increasingly unreliable recommendations within weeks of initial population.
Common Pitfalls
- Treating the absence of a hallucination as evidence of accuracy. The agentic research loop in this system still produces hallucinations and context drift after months of refinement — human QC remains required. Teams that deploy similar systems and assume the confidence score alone is sufficient without spot-check review of the underlying evidence citations will propagate vendor evaluation errors into procurement recommendations.
- Building the vendor research pipeline without a stable risk taxonomy to map into. Without a pre-defined taxonomy, the research agent's outputs are disconnected claims that cannot be systematically compared across vendors or used to identify coverage gaps. The taxonomy must be designed first; the agentic research layer maps into it, not the other way around.
The GenAI Risk Assessment Wizard
Once the risk taxonomy is mapped and vendor capabilities are scored against evidence, the challenge shifts from data quality to decision speed. The GenAI risk assessment wizard is the interface that collapses a months-long security evaluation into a structured 5-minute session — producing a prioritized vendor shortlist with implementation guidance before the next architecture review board meeting.
The wizard is driven entirely by architecture inputs gathered during the kind of use-case discovery conversation that typically precedes any security engagement. Instead of sending six vendor data sheets and hoping the team figures it out, the system takes those discovery answers and activates a risk profile dynamically.
Step 1 — Data Classification and Access Context
The first set of wizard inputs concerns data sensitivity and who touches it. For the Adjuster IQ demo system — an insurance claims adjuster running on AWS Bedrock with Claude 3.5 Sonnet — the inputs were:
- Data classification: Internal confidential (insurance claims data)
- Access model: Employees accessing the system over the internet
These two answers alone begin activating risks. Internal confidential data transiting public networks immediately surfaces data-in-transit controls, session management requirements, and data leakage prevention as relevant risk categories. The wizard renders this activation visually in real time — as each selection is made, the risk profile updates to surface associated inherent risks pulled from the underlying taxonomy.
Step 2 — Deployment Layer and Cloud Platform
The wizard then collects deployment environment and cloud platform specifics:
- Cloud platform: AWS Bedrock (managed PaaS)
- Autonomy level of the system: Creates new records (as opposed to read-only queries or fully autonomous multi-step actions)
The autonomy level selection is particularly consequential. Choosing “making autonomous actions” versus “creates new records” versus “read-only” produces a meaningfully different inherent risk profile. Agentic AI systems face a broader attack surface — including prompt injection paths that could trigger unintended record creation or data exfiltration — while read-only systems have a more constrained risk footprint.
This is the kind of distinction that typically gets lost in generic “AI security” evaluations that treat all GenAI systems as equivalent. The wizard encodes it explicitly, so the resulting vendor recommendations are calibrated to what the system actually does — not to a generic AI threat model.
Step 3 — Architecture Patterns and Technical Implementation
The third phase captures technical architecture patterns — the choices that surface at architecture review boards and trigger security assessments:
- RAG and traditional databases in use: Yes (Amazon Kendra[7] for retrieval)
- Orchestration approach: Direct API calls with tool calling
- Compute model: Serverless
- User interfaces: Chat interface, tool calling, document analysis
Each of these activates specific risk subcategories. RAG pipelines introduce data poisoning and retrieval manipulation risks. Tool-calling expands the attack surface to include prompt injection attacks that manipulate tool invocations. Document analysis surfaces unstructured data ingestion risks, particularly relevant for an insurance claims workflow processing uploaded claim documents.
Step 4 — Existing Vendor Stack and Preferences
Before generating recommendations, the wizard collects existing security investments and organizational preferences. This is where the system addresses the “we’re already in CrowdStrike and Zscaler[8] — what can they do for us?” question that derails most vendor evaluations:
- Existing platforms: AWS Bedrock (native), CrowdStrike[9], Zscaler
- Buy vs. build preference: Prefer to buy
- Implementation layer preference: Infrastructure (centrally deployed, rather than SDK/API per-app)
- Vendor selection criterion: Maximize capability coverage (platform-oriented)
The system queries the vendor capability matrix against these preferences and returns two key outputs:
Existing coverage: Out of the activated risk profile, 8 capabilities are already covered by current vendor relationships. CrowdStrike Falcon AI covers data disclosure controls at the infrastructure and endpoint layers. Zscaler covers network-layer GenAI traffic inspection. AWS Bedrock AI guardrails handle a subset of prompt injection and content filtering requirements. These are surfaced first — not as an afterthought, but as the primary filter — so the team avoids procuring capabilities they already own.
Identified gaps: 15 capability gaps remain after accounting for existing coverage. These are presented at the critical and high risk levels from the taxonomy, grouped by risk category.
Step 5 — Gap Triage and Build vs. Buy Decisions
The gap triage phase is where the wizard shifts from assessment to decision support. For each identified gap, the team can make a discrete disposition:
- Buy: Procure a commercial solution from the recommended vendor shortlist
- Build: Implement the control in-house using open-source tooling or internal workflow
- Accept: Formally accept the risk (with documentation)
The transcript demonstrates this explicitly with governance-related capabilities: “for some of the governance things, you know, we don’t need to buy a solution. That’s, you know, we’re perfectly capable of building an internal workflow in-house.” These build decisions are recorded as part of the output, so GRC teams have a documented rationale for each gap disposition — not just a vendor shortlist.
This triage step is important because it prevents the wizard from becoming a pure vendor-selection tool. Some AI security controls — model cards, AI use policy documentation, internal prompt review processes — are organizational and procedural, not technical products. The wizard accommodates both.
Step 6 — Vendor Shortlist and Implementation Guidance
After gap triage, the wizard produces a vendor shortlist filtered to the remaining buy decisions. For the Adjuster IQ assessment, this narrowed to five vendors covering the remaining capability gaps. The output is structured by capability, showing which vendors satisfy which controls — so the team can evaluate overlap and make consolidation decisions deliberately.
The final output page is designed to be printed and distributed directly to multiple teams with different roles:
For GRC teams: Risk register entries with taxonomy-mapped capability requirements, formal gap dispositions (buy/build/accept), and the evidence basis for each recommendation. This replaces the hand-jammed spreadsheet approach described in the procurement loop problem.
For cloud engineering teams: Configuration guidance for AWS Bedrock Guardrails and other infrastructure-layer controls already in the vendor stack. The system generates implementation notes for each native capability — covering, for example, what Bedrock Guardrails can and cannot cover for the specific risk profile activated by the wizard.
For security architecture and engineering teams: Vendor evaluation criteria derived from the activated risk profile. Because the shortlist is generated from capability-to-risk mappings with confidence scores, evaluation criteria are already defined before the first vendor demo — eliminating the “no clear eval criteria” problem that stalls POC cycles.
For dev teams choosing to build: Open-source implementation options are surfaced for each build-decision capability. Rather than a blank mandate to “build something for prompt injection,” developers receive references to specific libraries, SDK capabilities, and implementation patterns relevant to the architecture patterns already captured in the wizard.
Adjuster IQ GenAI System Risk Assessment — Live 5-Minute Wizard Walkthrough
Proof of Concept
-
Define the target system architecture: Adjuster IQ is a GenAI-powered insurance claims adjuster. Claims hit a queue, the system analyzes them, assigns complexity scores, identifies the right specialist routing, produces fraud indicators, and presents results to a human adjuster. Key architecture patterns: Claude 3.5 Sonnet via AWS Bedrock, Amazon Kendra for RAG, direct API calls for orchestration, tool calling enabled, session-scoped memory.
-
Open the risk assessment wizard: The wizard presents a structured series of use-case and architecture questions modeled on what a practitioner would ask during requirements discovery. The risk profile on-screen activates dynamically as each question is answered.
- Answer use-case questions (inherent risk profile phase):
- What does this system do? — Claims adjuster: ingests, analyzes, scores, and routes insurance claims.
- What data does it use? — Internal confidential data.
- Who uses it and how? — Employees accessing over the internet.
- Does it make autonomous actions? — Selecting “autonomous actions” vs. “create new record only” produces a markedly different inherent risk profile; the demo selects a moderate autonomy level.
- What is the highest level of autonomy in the dev toolchain? — Interactive coding assistants (not autonomous coding agents), which still extends the attack surface through the development pipeline.
- Answer architecture pattern questions (design-level risk phase):
- Managed AI platform: AWS Bedrock.
- Retrieval pattern: RAG with traditional databases (Amazon Kendra).
- API integration: direct API calls.
- Compute: serverless.
- Interface types: chat interface, tool calling, document analysis — each selection visibly shifts the risk profile displayed.
-
Apply existing vendor stack filters: The wizard asks which security platforms are already deployed. The demo selects AWS Bedrock (native Guardrails), CrowdStrike, and Zscaler. The system immediately maps these to capabilities they can already satisfy against the activated risk profile, reducing the gap list before any new vendor evaluation begins.
- Set procurement preferences:
- Buy vs. build preference: buy.
- Implementation layer preference: infrastructure (network/endpoint), not application SDK layer.
- Platform vs. point solution: platform-first, maximize capability coverage.
- Review the output — gap analysis:
- The system reports that existing vendor relationships (AWS Bedrock Guardrails + CrowdStrike + Zscaler) cover 8 capabilities against the activated risk profile.
- 15 capability gaps remain at critical and high risk levels.
- Each gap is individually reviewable; governance-class gaps can be flagged as “build” (internal workflow) rather than “buy,” reducing the commercial vendor shortlist further.
-
Generate the vendor shortlist: After gap disposition, the system produces a shortlist of 5 vendors capable of covering the remaining gaps, mapped by capability. A combo view shows which vendor combinations maximize coverage most efficiently.
-
Export the summary output: The final page is a printable summary containing: risk profile with activated risk categories; existing vendor coverage map; remaining gaps with build/buy disposition; vendor shortlist with per-capability confidence; implementation guidance scoped to GRC, cloud engineering, security architecture, and dev teams — including AWS Bedrock Guardrails configuration guidance and open-source alternatives for in-house builds.
- Key outcome: The entire assessment — from blank wizard to actionable vendor shortlist with implementation guidance — completes in under five minutes, demonstrating that structured use-case-driven questioning replaces the months-long POC cycle that typically ends with the GenAI system shipping before evaluation is complete.
Why This Approach Breaks the Procurement Loop
The governance feedback loop described at the start of this post stalls for a specific structural reason: evaluation criteria are undefined when vendor selection begins. Teams end up in the loop of “maybe AWS Guardrails covers this” and “maybe our DLP can handle it” and “maybe we should just build it” precisely because no one has mapped the specific system’s risk profile to a capability requirement list before engaging the market.
The wizard inverts this sequence. Risk profile definition happens first, before any vendor contact. Existing coverage is audited against that profile. Gaps are triaged with explicit buy/build/accept dispositions. Only then does vendor engagement begin — with a documented shortlist, defined evaluation criteria, and a risk register ready for GRC sign-off.
Actionable Takeaways
- Before engaging AI security vendors, run your GenAI system through a structured architecture input session covering data classification, deployment layer (managed PaaS vs. self-hosted), autonomy level (read-only vs. autonomous action), and technical patterns (RAG, tool-calling, document analysis). These four inputs alone generate a specific inherent risk profile that scopes your evaluation before the first demo.
- Audit existing vendor stack coverage against your activated risk profile before issuing RFPs or POC invitations. Platforms like AWS Bedrock Guardrails, CrowdStrike, and Zscaler are actively adding GenAI security capabilities — mapping their current coverage to your risk requirements frequently eliminates 30–50% of perceived gaps without new procurement.
- Produce explicit build/buy/accept dispositions for each identified capability gap, and document them before engaging the market. GRC teams need gap disposition rationale for risk register entries; engineering teams need it to prioritize in-house development work; procurement teams need it to avoid sourcing capabilities the organization intends to build internally.
Common Pitfalls
- Treating all GenAI systems as equivalent in AI security evaluations. A read-only document Q&A system and an autonomous multi-step claims processing agent have fundamentally different risk profiles — applying the same vendor shortlist to both will either over-engineer controls for the simpler system or leave critical gaps in the autonomous one. Autonomy level and tool-calling capabilities must be explicit wizard inputs, not assumed defaults.
- Beginning vendor evaluation before defining evaluation criteria. Running parallel POCs without a capability requirement list derived from the system's risk profile is the proximate cause of the procurement feedback loop — POCs stall because the team cannot score them against anything specific. The wizard approach requires risk profile definition to precede, not follow, vendor engagement.
Platform vs. Point Solution Trade-offs in AI Security
The Consolidation Question Every Practitioner Faces
One of the most common questions during AI security vendor evaluation is whether to consolidate around an existing platform or buy best-in-class point solutions. The capability matrix reveals a harder truth: in today’s AI security landscape, no platform vendor currently covers more than roughly 50% of the AI-specific risks most relevant to your system.
This was surfaced directly in the Q&A: even the largest platform players — Palo Alto Networks, CrowdStrike — provide coverage for only a subset of the risk taxonomy. Wiz, which leads in cloud security, has a small handful of AI security capabilities and is “very far from that category for now.” The implication is structural: the AI security market is still too fragmented for any single vendor to close all gaps.
What the Capability Matrix Reveals
The vendor capability matrix — built by mapping over 3,000 vendor claims to the synthesized OWASP/NIST/MITRE risk taxonomy — makes this gap visible rather than intuitive. When the Adjuster IQ demo system was run through the risk assessment wizard with existing investments in AWS Bedrock, CrowdStrike, and Zscaler, the output was concrete:
- 8 capabilities were already coverable through existing strategic vendor relationships
- 15 gaps remained at the critical and high risk levels
- Some of those gaps were addressable by building internal workflows rather than buying
This is what makes the platform-vs-point-solution decision defensible: instead of a gut-feel consolidation preference, you have a documented gap count, a shortlist of vendors addressing specific capabilities, and a build/buy decision per gap.
Using Platform Preference as a Filter, Not a Strategy
The wizard’s architecture supports a “platform-first” filter — you can declare existing strategic investments (e.g., CrowdStrike, Zscaler, AWS Bedrock) and the system surfaces what those platforms already cover before presenting new vendor recommendations. This is the right sequence: exhaust existing platform coverage first, then address residual gaps with point solutions.
The practical implication for security architects is to treat platform consolidation as a filter pass, not a complete strategy. Platform investments reduce procurement overhead and integration complexity for the capabilities they do cover — but they do not eliminate the need for additional tooling. In AI security specifically, the gap is large enough that a hybrid approach (platform + targeted point solutions for high-risk capabilities) is currently the only realistic model.
Actionable Takeaways
- Run the capability gap count before committing to a platform strategy. If your platform covers 8 of 23 critical capabilities, you need a plan for the remaining 15 — not an assumption that the platform roadmap will close them.
- Separate "build" from "buy" decisions per gap. Governance workflows and internal policy controls may be buildable in-house, while runtime inference protection or prompt injection detection may require a commercial solution. The wizard's per-gap build/buy toggle makes this explicit.
- Treat platform preference as a procurement filter, not an evaluation shortcut. Declare existing investments early, extract their coverage, then evaluate the residual gaps on evidence — not on vendor roadmap promises or "AI-powered" marketing language.
Common Pitfalls
- Assuming platform vendors will close AI security gaps on their roadmap. The capability matrix shows that even major platform players like CrowdStrike and Zscaler cover less than half of relevant AI risk categories today — deferring procurement decisions based on future roadmap commitments leaves critical gaps unaddressed while GenAI systems ship.
- Skipping the gap count and treating platform consolidation as a complete AI security strategy. Without mapping existing platform capabilities to a structured risk taxonomy, teams have no visibility into what risks remain exposed — leading to the same stalled procurement loop the wizard is designed to break.
Conclusion
The AI security procurement problem is not a vendor selection problem — it is a structural problem caused by starting vendor engagement before defining risk requirements. The framework described in this talk attacks the root cause: build a synthesized risk taxonomy first, score vendor claims against evidence at scale with an agentic research loop, and run a structured wizard that generates a documented shortlist with build/buy dispositions before any POC begins.
The practical result in the Adjuster IQ demo was a complete vendor evaluation — from blank wizard to printable shortlist with implementation guidance for four separate teams — in under five minutes. That is not a shortcut. It is the payoff from doing the architectural reasoning upfront rather than discovering it through stalled POC cycles.
For security engineers building AI governance programs, the transferable pattern is clear: a synthesized risk taxonomy anchored to OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS; evidence-backed confidence scoring for vendor claims; and a structured wizard that activates a risk profile from architecture inputs before any vendor contact. The specific implementation (Claude Code, GitHub Actions, a particular UI) is secondary. The methodology is what breaks the loop.
For deeper context on building enterprise AI security programs, the related topics of agentic AI risk and RAG security controls are worth exploring alongside this evaluation framework — the risks those architectures introduce are exactly what the wizard is designed to surface.
References & Tools
- NIST AI Risk Management Framework (AI RMF) — Governance and lifecycle risk management structure for AI systems across the map, measure, manage, govern lifecycle. ↩
- AWS Bedrock — Managed AI platform with native Guardrails; the deployment environment for the Adjuster IQ demo system. ↩
- OWASP LLM Top 10 — LLM-specific vulnerability classification covering prompt injection, insecure output handling, sensitive data disclosure, and related application-layer attack surfaces. ↩
- MITRE ATLAS — Adversarial ML attack technique catalog covering model evasion, data poisoning, model inversion, and ML supply chain attacks. ↩
- GitHub Actions — CI/CD automation platform used to trigger the nightly vendor re-research workflow. ↩
- Claude Code (Anthropic) — Agentic coding environment used to build and orchestrate the research agent and QC agent loop for vendor claim analysis. ↩
- Amazon Kendra — Managed enterprise search service used for RAG in the Adjuster IQ demo system. ↩
- Zscaler — Enterprise security platform covering network-layer GenAI traffic inspection; evaluated as an existing infrastructure-layer investment in the Adjuster IQ wizard demo. ↩
- CrowdStrike Falcon AI — AI security platform (via PNG acquisition) evaluated for data disclosure capability at application and infrastructure layers in the vendor card demo. ↩
Questions from the audience
Related deep dives
Kinetic Risk: Securing and Governing Physical AI in the Wild | [un]prompted 2026
Securing Workspace GenAI at Google Speed | [un]prompted 2026
Glass-Box Security: Operationalizing Mechanistic Interpretability | [un]prompted 2026