
GenAI threat detection at scale exposes a fundamental gap: traditional YARA rules were built to match bytes and regex patterns, but prompt injection attacks arrive as natural language — semantically equivalent phrases that no static signature can enumerate. An attacker who wants to hijack an LLM agent doesn’t need a specific string; “ignore all prior instructions” and “disregard every previous rule” are functionally identical, yet only one will match a hand-crafted pattern.
For security engineers defending AI-powered systems, this creates a detection crisis that scales with the model’s reach. This post covers how Mohamed Nabeel’s SuperYARA library closes that gap by combining string matching, semantic similarity, fine-tuned classifiers, and LLM-as-judge into a single composable rule engine — complete with pre-filtering patterns that cut detection costs by 99% at Palo Alto Networks’ production scale.
Key Takeaways
- You'll learn how to apply YARA-inspired semantic rules to detect prompt injection and other GenAI threats across millions of web pages without relying on expensive LLM calls for every sample.
- You'll be able to design layered, defense-in-depth detection pipelines that combine string matching, semantic similarity, ML classifiers, and LLM prompts — balancing detection coverage against cost.
- Apply the pre-filtering pattern to reduce LLM invocation costs by up to 99% while maintaining near-complete recall, making large-scale GenAI threat hunting operationally viable.
Why Traditional YARA Rules Fail Against GenAI Threats
The New Binary Is Natural Language
AI/ML security at scale begins with acknowledging a fundamental mismatch: YARA[1] was engineered to match bytes, strings, and regex patterns inside binaries and scripts. It excels at identifying malware PE files and obfuscated JavaScript — artifacts with predictable, enumerable structure. But when the “binary” becomes natural language, that structural assumption collapses.
As Mohamed Nabeel stated directly in his talk: “The new binary is natural language. It’s no longer PEs or scripts.” This shift has deep implications for detection engineering teams who have spent years building and maintaining YARA rule libraries.
Why Prompt Injection Defeats Static Signatures
Prompt injection — the number one threat in GenAI systems — can arrive through an almost unlimited variety of phrasings. Consider an attacker trying to override an LLM agent’s system instructions:
- “Ignore all previous instructions.”
- “Disregard every prior rule.”
- “Don’t execute previous rules.”
- “Forget your earlier constraints and follow these new ones.”
Each of these is semantically equivalent, yet none of the others will match a YARA rule written to catch the first. A detection engineer trying to cover this attack surface with traditional string matching quickly discovers the rule set explodes in complexity and still has gaps.
Nabeel illustrated this directly with a fictional startup scenario: SafeIntel.ai’s lead engineer Adder started writing a YARA rule to catch prompt injections delivered via query parameters. As she iterated to cover more variants, the number of parameters grew unmanageable. The rule became difficult to maintain, prone to false positives, and still missed variants it hadn’t explicitly enumerated.
The Core Limitation: Enumeration vs. Understanding
Traditional YARA rules operate on enumeration — you must list every pattern you want to catch. Against LLM security threats, this model breaks down for two structural reasons:
- Semantic equivalence is infinite. Natural language allows countless phrasings of the same intent. No finite rule set can enumerate them all.
- Delivery channels bypass network controls. Prompt injections can be embedded in query parameters, URL hashes, or dynamically injected into the browser itself — where the browser calls the LLM directly. “None of the firewall solutions will see it” in this case, because the payload never traverses a network inspection point as a detectable artifact.
This means the detection gap is not just a matter of writing more YARA rules. It is architectural: static pattern matching is the wrong primitive for a semantically rich attack surface.
What Security Engineers Are Missing
Teams that attempt to apply YARA rules directly to LLM inputs face predictable failure modes:
- High false negative rates on novel phrasings of known attacks
- Maintenance overhead that scales with every new prompt injection variant seen in the wild
- False positive pressure as overly broad patterns catch benign inputs that superficially resemble attacks
Palo Alto Networks detects millions of malware samples daily using YARA — the tooling is proven at scale. The problem is not YARA itself; it is applying a byte-pattern paradigm to a semantic problem. Closing this gap requires a fundamentally different matching primitive: one that understands meaning, not just structure.
Actionable Takeaways
- Audit any existing YARA-based detection pipelines that inspect LLM inputs or agent communications — they are almost certainly missing semantic variants of known attacks. Treat their output as a lower bound on true threat volume, not a reliable signal.
- When scoping detection coverage for prompt injection, explicitly map delivery channels (query parameters, URL hashes, browser-side LLM calls) to understand which ones bypass your current network inspection points, and prioritize endpoint or application-layer detection accordingly.
- Shift the mental model from "enumerate bad patterns" to "understand malicious intent" — this reframe is the prerequisite for adopting semantic detection tools like SuperYARA and for writing rule specifications in natural language rather than regex.
Common Pitfalls
- Iterating indefinitely on YARA rule variants to cover more prompt injection phrasings — this approach produces rules that are hard to maintain, still miss unseen variants, and introduce false positives as patterns are broadened to compensate.
- Assuming network-layer or WAF inspection will catch all prompt injection payloads — browser-side LLM calls (where the browser directly invokes the model) completely bypass firewall solutions, making application-layer semantic detection essential.
SafeIntel.ai Query Parameter Prompt Injection Detection with SuperYARA
Proof of Concept
-
Establish the threat surface. SafeIntel.ai exposes an LLM API whose inputs arrive as HTTP query parameters. The lead engineer, Adder, identifies that users can inject adversarial prompts through these query parameters. Additional injection vectors include URL fragments (hashes) and dynamically detonated payloads inside the browser — the latter invisible to firewall solutions because the browser itself calls the LLM directly.
-
Attempt detection with traditional YARA. Adder writes a classic YARA rule to match known prompt injection strings. As she iterates, the number of condition operands grows rapidly. The rule becomes difficult to maintain, prone to false positives and false negatives, and unable to cover semantic variants — “ignore all prior instructions” is caught, but “don’t execute previous rules” is not.
-
Identify the core limitation of static pattern matching. Traditional YARA operates on lexical byte/string matching. Natural language attacks are semantically equivalent across dozens of phrasings. A static rule cannot enumerate all variants without growing unmanageably large and introducing noise.
-
Rewrite the rule using SuperYARA semantic similarity. Adder adopts SuperYARA[2] and rewrites the entire detection rule in two lines. Instead of listing string literals, she specifies the intent of what she wants to catch in natural language. SuperYARA performs semantic similarity matching internally — a vector-space comparison — that understands “don’t execute previous rules” is equivalent to “ignore all prior instructions” without explicit enumeration:
rule prompt_injection { similarity: "ignore or override previous instructions" } -
Deploy to shadow production for validation. Adder places the new SuperYARA rule in shadow prod alongside the old YARA rule to compare detection rates on real traffic without affecting production.
-
Compare detection results. The SuperYARA semantic rule detects materially more prompt injection attempts than the traditional YARA rule, covering phrasing variants the static rule missed entirely.
-
Acknowledge the cost trade-off. Semantic similarity matching requires computing embeddings — more expensive per query than a raw string match, but substantially cheaper than invoking a full LLM for every sample.
-
Note the residual gap. The semantic similarity rule still fails to catch certain harder injections — “role override” phrasing, for example. The solution is to layer additional SuperYARA constructs (classifier or LLM-as-judge) on top, using a defense-in-depth approach to close the remaining gap without abandoning the cost-efficient base layer.
SuperYARA’s Four Detection Constructs and Their Trade-offs
GenAI threat detection with SuperYARA introduces a layered detection model built around four distinct constructs. Each construct occupies a different position on the efficacy-cost curve, and understanding where each one sits is the foundation of any principled large-scale strategy.
The Four Constructs
1. String Matching
The string construct is functionally identical to traditional YARA rules. It performs direct byte-level or substring pattern matching against the input. If you already have a set of known malicious strings — a specific prompt injection payload, a hardcoded command sequence — this construct finds it cheaply and instantly.
- Latency: sub-millisecond
- Cost: negligible
- Detection coverage: limited to exact or near-exact pattern variants
This is your cheapest operation and the right starting point for any rule. It is not a replacement for semantic awareness; it is the foundation you build on.
2. Semantic Similarity
The similarity construct replaces string enumeration with a vector embedding comparison. Instead of listing every possible variant of “ignore all prior instructions,” you describe what you want to catch in natural language and the library computes a semantic match score internally.
The practical consequence is significant: a rule that previously required dozens of regex alternations to catch prompt injection variants can be reduced to two lines — a natural language description of the threat and a similarity threshold. Phrases like “don’t execute previous rules” or “disregard your system prompt” will match even if they were never explicitly enumerated, because the model understands semantic similarity.
- Latency: higher than string, but far below LLM invocation
- Cost: low-to-moderate (embedding computation, no LLM token spend)
- Detection coverage: substantially broader than string matching; catches semantic paraphrases
A key limitation: similarity matching still struggles with highly indirect or abstract injections — “role override” phrasing that doesn’t semantically cluster with standard injection language. This is where the next two constructs become necessary.
3. ML Classifier
The classifier construct allows you to plug in any binary or multiclass classification model as a detection condition. The rule engine invokes the classifier against the input and uses the model’s output label (and optionally its confidence score threshold) as the match condition.
In the ClickFix detection example, a fine-tuned DeBERTa[3] transformer was used as the third-layer classifier. In the brand impersonation pipeline, an openly available HuggingFace phishing detector[4] was dropped in with no modification. The classifier’s threshold is tunable and should be calibrated against a ground-truth dataset of known-benign and known-malicious samples before production deployment.
- Latency: hundreds of milliseconds (transformer inference); orders of magnitude faster than an LLM call
- Cost: moderate; no per-token API spend if the model runs locally
- Detection coverage: high recall, but false-positive prone when used alone
The false-positive characteristic of classifiers is a critical trade-off. A phishing classifier with high recall will inevitably flag legitimate content. This makes classifiers excellent pre-filters but poor final arbiters.
4. LLM Prompt (LLM-as-Judge)
The LLM construct lets you write an arbitrary natural language prompt and use it as the detection condition. The rule engine submits the input and the prompt to the configured LLM — Gemini[5], OpenAI, or a locally hosted Ollama[6] model — and interprets the response as a match signal.
This is the most powerful construct in the library. It can reason over subtle, context-dependent threats that no embedding or classifier would catch. It is also the most expensive and the slowest.
- Latency: ~4.5 seconds per request (observed with Gemini 2.0 Pro in production experiments)
- Cost: high; Gemini 2.0 Pro cost $750 for 10,000 requests without pre-filtering
- Detection coverage: near-complete; handles the hardest semantic variants
At Palo Alto Networks’ production volumes — millions of URLs processed daily — invoking an LLM on every sample is not operationally viable. The LLM construct exists as the final layer of a cascade, not the first.
The Efficacy-Cost Spectrum
Nabeel frames these four constructs explicitly as a defense in depth spectrum:
| Construct | Detection Power | Latency | Cost |
|---|---|---|---|
| String | Lowest | Fastest | Cheapest |
| Similarity | Moderate | Fast | Low |
| Classifier | High | Moderate | Moderate |
| LLM | Highest | Slowest | Most expensive |
The recommendation is not to pick one construct per rule but to compose them. A rule can express string OR similarity OR classifier as a cascading OR condition, where the engine is optimized to execute the cheapest construct first. If the string condition fires, the engine short-circuits and never invokes the classifier or LLM.
Routing Logic and Construct Selection in Practice
The SuperYARA engine automatically optimizes execution order for OR conditions: cheaper constructs run first, more expensive ones only fire if the cheaper ones miss. This is not user-configured behavior; it is built into the engine.
For LLM construct selection, the recommended approach is to start with the most capable model to establish baseline detection coverage, then progressively downgrade the model version until you find the minimum capability level that still meets your detection requirements.
Models and Classifiers Are Preloaded
All models and LLMs used within a SuperYARA rule set are preloaded into memory at initialization, not loaded per-request. This prevents the significant overhead of model loading on every rule invocation — a cost that would otherwise make high-volume scanning completely impractical. This design reflects the same operational philosophy as YARA itself: rules should be fast to evaluate once the engine is initialized.
Actionable Takeaways
- Map each construct to a cost tier before designing any rule: use string matching as the default starting point, add similarity only when pattern enumeration becomes unmanageable, introduce a classifier when similarity misses semantic edge cases, and reserve LLM constructs for the final layer of a cascade handling the hardest-to-detect variants.
- Calibrate classifier thresholds against labeled ground-truth data (known-benign and known-malicious samples) before deploying any classifier construct in a customer-facing pipeline — high-recall classifiers produce false positives that degrade trust if used as standalone detection gates.
- When using the LLM construct, start with the most capable model available to validate detection coverage end-to-end, then iteratively downgrade the model version to find the lowest-cost option that maintains acceptable recall.
Common Pitfalls
- Relying on a single construct for end-to-end detection. Using only a classifier delivers high recall but generates false positives that make production deployment untenable for customer-facing systems. Using only string rules misses 50% or more of semantic variants. No single construct is sufficient across the full threat surface.
- Invoking the LLM construct indiscriminately on every input without a cheaper pre-filter gate. At production scale (millions of URLs per day), this transforms a manageable detection pipeline into a multi-hour, high-cost bottleneck — as demonstrated by the $750 vs. $13.50 cost comparison for 10,000 samples with and without pre-filtering.
Defense in Depth Detection Pipeline Design for GenAI Threats
Why No Single Detection Construct Is Sufficient
Defense in depth is a security principle most engineers apply at the network and application boundary — but Nabeel’s work at Palo Alto Networks demonstrates it is equally critical inside GenAI threat detection pipelines. When analyzing the detection coverage of string rules, semantic similarity, ML classifiers, and LLM-as-judge prompts individually, each construct has a distinct blind spot. The only way to achieve near-complete coverage is to layer them in a cascading pipeline where cheaper constructs run first and more expensive ones handle what escapes.
In Nabeel’s ClickFix detection experiment using a real-world sample set, a string-only YARA rule caught 50% of threats. That means deploying only string matching leaves half your threat surface undetected. Running only the fine-tuned DeBERTa classifier catches nearly all threats but at significantly higher cost and latency. The defense-in-depth pipeline achieves both — near-complete coverage at a fraction of the cost of running the expensive construct on every input.
The ClickFix Detection Pipeline: A Three-Layer Cascade
ClickFix is a social engineering attack where users are manipulated into pasting malicious commands into their browser or terminal. While not a purely GenAI-native threat, it has evolved: attackers are now using ClickFix as a delivery vector for malicious agent skills — code that gets installed into AI agent runtimes and executes on the user’s machine.
Nabeel’s SuperYARA rule for ClickFix implements three detection layers that execute in cost order:
Layer 1 — String Matching (cheapest, fastest): The first condition is a standard string rule, functionally equivalent to a traditional YARA rule. It runs in microseconds and requires no model inference. In the experiment, string matching alone identified approximately 50% of ClickFix samples. For high-volume scanning across millions of web pages, this layer eliminates half the threat surface at negligible computational cost.
Layer 2 — Semantic Similarity (intermediate cost): If the string condition does not match, the rule falls through to a semantic similarity check. This construct computes an embedding-based similarity score between the input text and a natural language description of ClickFix behavior. It catches variants that rephrase the attack without using the specific strings the first layer targets.
Layer 3 — Fine-tuned DeBERTa Classifier (highest cost in this pipeline): If both the string and similarity conditions miss, the rule escalates to a fine-tuned DeBERTa model — a transformer-based binary classifier trained on prompt injection and ClickFix patterns. This layer is the most expensive but is only invoked on the residual samples that escaped the first two layers.
Together, the three-layer pipeline catches the remaining 45–48% of threats that string matching misses, bringing total detection coverage to near-complete.
Cascade Execution Logic and Cost Optimization
The SuperYARA engine is optimized to execute the cheapest rule first when conditions are joined with an OR relationship. For a cascading threat detection pipeline like ClickFix — where any single layer matching is sufficient to flag the sample — the engine short-circuits as soon as a match is found:
- If Layer 1 (string) matches → flag and stop. No embedding computation, no model inference.
- If Layer 1 misses but Layer 2 (similarity) matches → flag and stop. No classifier invocation.
- Only if both Layer 1 and Layer 2 miss does the engine invoke the DeBERTa classifier.
The result: roughly 50% of samples never reach Layer 2, and the bulk of remaining samples are resolved at Layer 2. The expensive DeBERTa model is only called on the small residual set that the first two layers couldn’t resolve.
Quantifying the Cost vs. Coverage Trade-off
| Approach | Detection Coverage | Cost Profile |
|---|---|---|
| String-only | ~50% | Very low |
| Classifier-only | ~95–98% | High (every sample hits the model) |
| Defense-in-depth (3 layers) | ~95–98% | ~50% cost reduction vs. classifier-only |
Deploying the defense-in-depth pipeline reduces cost by approximately 50% compared to running the classifier on every sample, while achieving equivalent detection coverage. For a team scanning millions of URLs per day, this is not a marginal optimization — it is the difference between a pipeline that is operationally viable and one that is not.
Applying the Pattern to Your Own Detection Pipelines
- Start with a string rule covering the most common, literal variants of the threat. This layer is free to run at scale.
- Add a semantic similarity layer to catch paraphrased or semantically equivalent variants. Write the similarity condition in natural language describing the attack intent.
- Add a classifier or LLM layer last as the catch-all for variants that evade both literal and semantic matching. Fine-tune the classifier on your own labeled data if you have it; otherwise plug in an off-the-shelf model from HuggingFace as a starting point.
- Set cascade order explicitly (cheapest first) and let the SuperYARA engine handle short-circuiting on match.
Shadow Deployment Before Production Enablement
Before enabling any new rule in production, Nabeel recommends shadow deployment for approximately one week. In shadow mode, the rule runs and logs matches but does not block or alert. This surfaces false positive rates against real production traffic before the rule affects users. Threshold tuning for classifiers should be validated against known-benign and known-malicious ground truth before the rule goes live.
Actionable Takeaways
- Structure your SuperYARA detection rules as a three-layer cascade — string matching first, semantic similarity second, classifier or LLM last — so that cheap constructs eliminate the majority of samples before expensive model inference is ever invoked. This directly mirrors the ClickFix pipeline where string rules resolved 50% of threats at near-zero cost.
- Never rely on a single detection construct for GenAI threats. Validate coverage across all four construct types on a labeled sample set before deploying to production: string-only rules will have large blind spots for semantically varied attacks, while classifier-only pipelines will be cost-prohibitive at scale.
- Deploy new rules in shadow mode for approximately one week before enabling them in production. Real production traffic will surface false positive rates that labeled test sets miss, especially for semantic and classifier constructs.
Common Pitfalls
- Running the most expensive construct (classifier or LLM) on every sample without pre-filtering. In the ClickFix experiment, applying the DeBERTa classifier to all inputs would catch the same threats as the defense-in-depth pipeline but at roughly double the cost — with no improvement in detection coverage.
- Treating string-only YARA rules as sufficient for GenAI threat detection. String matching caught only 50% of ClickFix samples. Deploying a traditional YARA rule against natural-language attacks without semantic or classifier fallback creates a 50% blind spot in your detection coverage.
Layered ClickFix Attack Detection Using String, Similarity, and DeBERTa Classifier
Proof of Concept
-
Understand the threat context. ClickFix carries a social engineering component that semantic rules can catch. Attackers have begun using ClickFix to deliver malicious agent skills to target machines, making it a direct concern for AI-powered systems.
- Define the three-layer SuperYARA rule. Author a single SuperYARA rule that encodes three conditions at increasing cost and detection power:
- Layer 1 — String construct: A traditional YARA-style string/pattern match. Near-zero latency, no model inference. Catches the most obvious, unobfuscated ClickFix strings verbatim.
- Layer 2 — Similarity construct: A semantic similarity match against a natural-language description of ClickFix behavior. More expensive than string matching but cheaper than a classifier. Catches paraphrased or lightly obfuscated variants Layer 1 misses.
- Layer 3 — Classifier construct: A fine-tuned DeBERTa transformer model plugged in as a binary classifier. The most expensive layer but catches the hardest prompt injection variants — those that are semantically distant from any fixed description but structurally recognizable to a trained model.
- Configure the rule engine’s execution order. SuperYARA’s rule engine executes the cheapest construct first when conditions are joined with
ORlogic:- Evaluate the string match first. If it fires → flag as ClickFix, stop.
- If the string match does not fire → evaluate the similarity construct. If it fires → flag as ClickFix, stop.
- If neither fires → fall through to the DeBERTa classifier.
- If all three fail → sample is not flagged.
- Observe the experimental detection breakdown. When run against a real sample set:
- ~50% of threats are caught by the cheap string rule alone (Layer 1). No model inference required.
- ~45–48% of threats are caught by the similarity and DeBERTa classifier layers (Layers 2 and 3).
- The remaining undetected fraction falls below all three layers.
-
Quantify the cost savings of layering. Deploying only Layer 3 (DeBERTa classifier) across all samples achieves near-complete recall but pays the full classifier inference cost on every input. By placing Layers 1 and 2 in front, approximately 50% of samples are resolved before the classifier is ever invoked, cutting classifier inference costs roughly in half.
-
Apply the pluggable architecture to swap components. The DeBERTa classifier is loaded once at initialization and held in global memory. Any binary or multiclass HuggingFace-compatible classifier can be substituted by wrapping it in SuperYARA’s classifier interface contract.
- Deploy with shadow mode validation first. Before enabling the rule in production, run it in shadow mode for approximately one week against live traffic to measure false positive rates from real-world data.
Pre-filtering Pattern to Reduce LLM Detection Costs at Scale
The Pre-filtering Problem: Why Throwing LLMs at Everything Breaks Budgets
When teams adopt GenAI threat detection at scale, the instinctive move is to route every suspicious input through an LLM — the most powerful and flexible detection construct available. The problem is that LLMs are expensive and slow. At Palo Alto Networks, millions of URLs are processed daily. Feeding each one directly to a model like Gemini 2.0 Pro makes detection operationally unviable: costs spike into the hundreds of dollars per 10,000 requests, and latency balloons to hours of total processing time.
The pre-filtering pattern solves this by placing a cheap, high-recall classifier in front of the expensive LLM. The classifier acts as a gate: it processes every sample quickly and inexpensively, passing only the flagged subset to the LLM for final verdict. Because the classifier is optimized for recall (not precision), it is acceptable for it to generate some false positives — those get resolved by the LLM. What matters is that it never misses a true positive.
Real-World Example: Brand Impersonation Detection
The transcript demonstrates pre-filtering through a brand impersonation detection use case. GenAI has made it trivially easy to produce convincing phishing pages and agent-level spoofing at scale. As agents increasingly communicate with each other, impersonation attacks between agents are expected to become a significant new attack surface.
The SuperYARA rule for this threat uses two constructs chained together:
-
Classifier rule (pre-filter): A publicly available HuggingFace phishing detector is downloaded and plugged directly into SuperYARA as a binary classifier. It runs on every input, is extremely fast (sub-second latency, roughly half a second per sample), and has high recall — it catches nearly all true positives at the cost of some false positives.
-
LLM rule (final gate): Samples that pass the classifier are forwarded to Gemini 2.0 Pro for a definitive verdict. The LLM is accurate and low false-positive, but slow (approximately 4.5 seconds per call on average) and expensive.
The classifier is intentionally false-positive-prone, and that is acceptable. Its purpose is not to make the final call — it is to dramatically reduce the number of samples the LLM ever sees. Customers only experience the LLM’s verdict, which is precise. The classifier’s false positives are invisible to them because they are filtered out by the LLM before any alert fires.
The Numbers: From Hours to Minutes, From $750 to $13.50
On a dataset of 10,000 samples:
| Approach | Time | Cost (Gemini 2.0 Pro) |
|---|---|---|
| LLM only (no pre-filter) | Hours | ~$750 |
| Classifier pre-filter + LLM | Minutes | ~$13.50 |
That is approximately a 98.2% reduction in both cost and processing time. At Palo Alto Networks’ production scale — millions of URLs per day — the difference between these two approaches is the difference between a viable pipeline and an operationally impossible one.
Why High Recall on the Pre-filter Is Non-Negotiable
The pre-filtering pattern works only if the classifier has high recall — meaning it does not miss true threats. False negatives at the classifier layer are final: if a threat is not flagged by the pre-filter, it never reaches the LLM and is never detected. This is why a false-positive-prone classifier is acceptable as a pre-filter, but a false-negative-prone one is not.
When tuning a classifier for use as a pre-filter, set the decision threshold aggressively toward recall. It is better to pass 10× more samples to the LLM than to miss a single true positive. The LLM handles the precision problem; the classifier’s job is recall coverage.
Pairing Cheap and Expensive Rules: The General Pattern
The brand impersonation example illustrates a general design principle applicable to any SuperYARA rule involving an LLM construct:
- Step 1: Write a cheap rule — a string match, semantic similarity check, or classifier — and run it first on every sample.
- Step 2: Use the cheap rule’s output as the condition that triggers the expensive LLM rule.
- Step 3: The LLM only activates when the cheap rule fires, dramatically reducing invocation count.
This is the same mental model as pre-filtering in traditional security pipelines (e.g., fast signature triage before deep inspection), now applied to LLM-backed detection. The principle scales: the more expensive the LLM call, the greater the savings from even a moderately effective pre-filter.
Practical Considerations for Production Deployment
The false-positive rate of the classifier must be acceptable to pass to the LLM without causing downstream pipeline overload. If the classifier flags 80% of all inputs, the LLM cost savings disappear. Tuning the classifier threshold using a labeled ground-truth dataset lets teams find the operating point that maximizes recall while keeping the LLM invocation rate manageable.
Classifiers should be validated in shadow mode before production deployment, exactly as YARA rules are. Running in shadow for approximately one week against live traffic reveals unexpected false-positive rates that lab testing does not surface.
Actionable Takeaways
- Always pair an LLM detection construct with a cheap pre-filter (string match, semantic similarity, or classifier) in any SuperYARA rule you deploy at scale. Never route raw, unfiltered input volumes directly to an LLM — the 99% cost reduction demonstrated on 10,000 samples compounds dramatically at millions of daily requests.
- Tune your pre-filter classifier for recall, not precision. Set the decision threshold aggressively to minimize false negatives: it is operationally acceptable to pass false positives to the LLM for resolution, but threats that the pre-filter misses are permanently undetected.
- Validate any new classifier-based rule in shadow mode for at least one week before enabling it in production. Lab false-positive rates routinely diverge from production rates once the rule encounters real-world traffic diversity.
Common Pitfalls
- Deploying an LLM construct without a pre-filter at production volumes. At Palo Alto Networks scale, running Gemini 2.0 Pro on every URL without pre-filtering results in hours of latency and ~$750 per 10,000 requests — operationally unviable compared to the ~$13.50 and minutes achieved with pre-filtering.
- Using a classifier as a pre-filter without verifying its recall characteristics. A classifier optimized for precision will generate low false positives but may miss many true threats at the gate layer, resulting in those threats never reaching the LLM and going undetected entirely.
Brand Impersonation Pre-filtering with HuggingFace Phishing Classifier and Gemini LLM
Proof of Concept
-
Identify the threat context. Brand impersonation via GenAI is a real and growing threat. Because LLMs can generate convincing lookalike content at scale, and because agents will increasingly communicate with other agents, impersonation detection must operate at millions-of-URLs-per-day scale. A naive approach — running Gemini 2.0 Pro on every URL — costs $750 per 10,000 requests and takes hours; this is operationally unviable at Palo Alto Networks scale.
-
Download and plug in the HuggingFace phishing classifier. An openly available, pre-trained phishing detector is pulled directly from HuggingFace. No fine-tuning is required for this layer. The classifier is registered as a SuperYARA
classifierconstruct and returns a binary classification (phishing / not phishing) in milliseconds (~0.5 seconds per sample). -
Configure the classifier rule as a pre-filter (Rule 1). The classifier’s role is not to be the final verdict — it is a pre-filter whose job is high recall with accepted false positives. The threshold is tuned to catch virtually all phishing content, even if it passes some benign samples through.
-
Wire in the Gemini LLM rule as the second condition (Rule 2). A SuperYARA
llmconstruct is defined using a natural language prompt that asks Gemini to determine whether the page content constitutes brand impersonation. This construct is expensive (~4.5 seconds per invocation, $0.075 per 1,000 requests at Gemini 2.0 Pro pricing) and is only invoked on the subset of samples that pass the classifier pre-filter. -
Define the combined rule logic. The two rules are chained: a sample must pass Rule 1 (classifier flags it as likely phishing) AND Rule 2 (LLM confirms brand impersonation). The SuperYARA engine executes Rule 1 first — its cheaper cost makes it the natural gate. Only classifier-positive samples are forwarded to the LLM.
- Observe the cost and time reduction. On a 10,000-sample dataset:
- Without pre-filtering (LLM on every sample): ~$750, processing time measured in hours.
- With pre-filtering (classifier gate → LLM only on positives): ~$13.50, processing time drops to minutes.
- This represents a ~98.2% cost reduction driven entirely by the classifier’s ability to eliminate the vast majority of clearly benign samples before the LLM is ever invoked.
-
Understand the false positive trade-off. The classifier alone cannot be used as the sole detection mechanism — its false positive rate is too high for production alerting. However, false positives at the pre-filter stage are tolerable because the LLM second layer acts as a precision corrector. The classifier’s job is recall; the LLM’s job is precision. Together, they achieve both.
-
Generalize the pattern. Any detection scenario where an expensive, high-precision construct (LLM or heavy classifier) can be preceded by a cheap, high-recall gate (string rule, semantic similarity, or lightweight classifier) benefits from this architecture. Applying pre-filtering consistently achieves ~99% cost reduction compared to running the expensive construct unconditionally.
- Production deployment notes. Models are pre-loaded into global memory at initialization time, so the classifier and LLM are not reloaded per-invocation. Shadow deployment (~1 week before enabling in production) is recommended to validate FP rates against real traffic before committing to live blocking.
SuperYARA Architecture and Pluggable Pipeline Components
The Factory Pattern at SuperYARA’s Core
SuperYARA is built around a pluggable factory pattern — a deliberate architectural choice that separates rule logic from execution components. Every moving part in the pipeline is swappable: cleaners, chunkers, classifiers, similarity models, and LLMs all conform to defined interfaces (contracts). The moment you satisfy an interface, you can drop in any implementation without changing a single line in your rule files. This is the same philosophy that made YARA extensible via module imports, pushed further to cover the full spectrum of ML and language model backends.
Cleaners: Stripping Noise Before Detection
When detecting threats in web content, raw HTML is the enemy of accurate matching. A semantic similarity model asked to compare a malicious instruction against a 50KB HTML document filled with navigation menus, CSS class names, and JavaScript boilerplate will return a low score regardless of the actual threat content. SuperYARA cleaners solve this by pre-processing input data before it reaches any detection construct.
Out of the box, the library ships standard cleaners that:
- Accept raw HTML as input
- Strip all decorative markup, script tags, style blocks, and structural noise
- Return the extracted textual content — the “meat” an analyst or model actually cares about
You can plug in custom cleaners by implementing the cleaner interface. The rule syntax does not change; you simply specify which cleaner to use, and the rule engine handles invocation automatically.
Chunkers: Solving the Scope Problem in Semantic Matching
Semantic similarity works by embedding two pieces of text and measuring vector distance. When you compare a short malicious phrase against an entire document, the document embedding gets diluted by thousands of irrelevant tokens, suppressing the similarity score even when the threat phrase is present. Chunkers solve this by splitting the input into segments before similarity matching runs.
SuperYARA supports multiple chunking strategies out of the box:
- Sentence-level chunking — splits on sentence boundaries
- Paragraph-level chunking — preserves logical blocks of content
- Overlapping window chunking — slides a fixed-size window with overlap to avoid boundary misses
- Custom chunkers — any chunking logic you implement against the interface
When chunking is active, the rule engine evaluates each chunk independently and returns the best match across all chunks — not just the first hit. This mirrors YARA’s behavior of scanning through an entire document and returning all matches. As Nabeel confirmed during Q&A: “It’ll get the best match out of all — if it matches multiple ones, it will identify the multiple chunks and return it.”
Backend Flexibility: Swap Any LLM, Classifier, or Similarity Model
The four detection constructs in SuperYARA each have fully replaceable backends:
LLM construct:
- Supports any OpenAI-compatible API endpoint (GPT-4, GPT-3.5, etc.)
- Supports Google Gemini models (Gemini 2.0 Pro used in benchmarks)
- Supports locally hosted models via Ollama — zero external API cost, full data sovereignty
- The contract: wrap your LLM call inside the provided interface; as long as you meet it, any model works
Classifier construct:
- Drop in any binary or multiclass classifier
- The presenter used a fine-tuned DeBERTa model for ClickFix prompt injection detection and a publicly available HuggingFace phishing detector for brand impersonation — both plugged in without modifying the rule engine
- Threshold tuning is supported: the classification confidence cutoff is configurable
Similarity construct:
- Any embedding model that satisfies the interface can be substituted
- Critical for teams with domain-specific corpora or compliance requirements around data leaving the environment
Model Preloading for Production Performance
A critical implementation detail: models and LLMs are preloaded at initialization, not loaded per rule invocation. Loading a transformer classifier or making an LLM client connection on every single document scan would make large-scale detection completely infeasible. SuperYARA initializes all backends once into global memory; subsequent rule runs reference the already-loaded objects. This is the same atom-precomputation approach YARA uses for string patterns — generate once, match many times.
Rule Engine Execution Order Optimization
When a rule uses an OR condition across multiple constructs, the engine executes the cheapest construct first. A string match completes in microseconds; semantic similarity takes milliseconds; an LLM call takes seconds. By running cheaper constructs first and short-circuiting on a match, the engine avoids unnecessary expensive calls.
For AND conditions, both constructs must execute — but this is the intentional design for pre-filtering, where a cheap classifier must pass before the expensive LLM gets invoked.
Installation and Entry Point
SuperYARA is fully open source and pip-installable:
pip install sara
The library is available on GitHub and documented at the project’s .org site, which includes working examples and demos covering use cases beyond what the talk demonstrated.
Actionable Takeaways
- When deploying SuperYARA against web content, always configure a cleaner appropriate to your input format (HTML cleaner for web pages) before any similarity or classifier construct runs — feeding raw HTML to semantic models suppresses detection scores and creates false negatives.
- Choose your chunking strategy based on threat pattern length: use sentence-level chunking for short injected phrases in long documents, and paragraph-level or overlapping window chunking when the malicious content may span multiple sentences.
- When introducing a new classifier or LLM backend, validate it in shadow mode for approximately one week against real traffic before enabling it in production — the same CI/CD validation discipline used for traditional YARA rules applies directly here.
Common Pitfalls
- Feeding raw, unprocessed HTML directly to similarity or LLM constructs without a cleaner. Structural HTML noise dilutes embedding representations and inflates LLM token costs, producing unreliable detection scores and unnecessary API spend.
- Loading classifier and LLM backends per rule invocation rather than at initialization. This is an architectural mistake that makes production-scale scanning infeasible — models must be preloaded into memory once and reused across all rule evaluations.
Conclusion
Mohamed Nabeel’s SuperYARA library represents a practical answer to a real operations problem: as AI agents become the new execution layer in enterprise environments, the threats targeting them arrive in natural language — a medium that byte-pattern matching tools were never designed to handle. The library doesn’t discard what makes YARA effective (familiar syntax, composable conditions, operational discipline like shadow deployment); it extends that foundation with semantic similarity, ML classifiers, and LLM-as-judge into a single rule engine.
The core insight from this talk is architectural, not just tooling-specific: detection pipelines for GenAI threats must be built as cost-ordered cascades from the start. Cheap string rules resolve 50% of threats at negligible cost. Semantic similarity closes the long tail of paraphrased variants. Classifiers handle the hard cases at moderate cost. LLMs serve as precision arbiters only for the small residual set that cheaper layers can’t resolve. Pre-filtering a high-recall classifier before any LLM call is what makes the entire system viable at Palo Alto-scale volumes — a 99% cost reduction is not a minor optimization; it is the threshold between a pipeline you can run and one you can’t.
For teams building or scaling GenAI security capabilities, the immediately actionable steps are: audit your existing YARA pipelines for LLM input coverage gaps, evaluate SuperYARA against your highest-volume GenAI threat categories, and adopt the pre-filtering pattern before deploying any LLM-backed detection rule at production scale.
Explore related topics on this site:
- AI/ML security talks and research
- Detection engineering techniques
- Prompt injection attack patterns and defenses
FAQ
Why do traditional YARA rules fail against prompt injection attacks?
Traditional YARA rules match bytes, strings, and regex patterns — enumerable artifacts. Prompt injection arrives as natural language, where semantically identical attacks can be phrased in unlimited ways. No finite rule set can enumerate every variant of “ignore all prior instructions,” making static signatures structurally unsuitable for this threat class. The delivery channel compounds this: browser-side LLM calls bypass network inspection entirely, so even well-maintained YARA rules at the network layer miss a significant subset of real attacks.
What are SuperYARA’s four detection constructs and when should each be used?
SuperYARA supports string matching (cheapest, for known literal variants), semantic similarity (for paraphrased variants using vector embeddings), ML classifiers (for broader coverage with tunable recall/precision trade-off), and LLM-as-judge prompts (highest accuracy, highest cost). Use them in ascending cost order as layers in a cascade — not as standalone solutions. The engine automatically executes cheapest-first for OR conditions and short-circuits on the first match, so construct selection directly determines operational cost.
How much cost reduction does the pre-filtering pattern actually deliver?
In Palo Alto Networks’ experiments on 10,000 samples, pairing a HuggingFace phishing classifier as a pre-filter with Gemini 2.0 Pro as the final gate reduced cost from approximately $750 to $13.50 — a 98.2% reduction. Processing time dropped from hours to minutes. This figure compounds at production scale: at millions of URLs per day, the difference determines whether the pipeline is operationally viable or not.
How do I plug a custom classifier or LLM into SuperYARA?
SuperYARA uses a pluggable factory pattern where every component — cleaners, chunkers, classifiers, similarity models, and LLMs — conforms to a defined interface contract. Wrap your model call inside the provided interface and drop it into your rule. The rule syntax does not change. Models are preloaded at initialization (not per invocation), so production-scale performance is maintained. Any OpenAI-compatible API endpoint, Google Gemini model, or locally hosted Ollama model satisfies the LLM contract without library modification.
References & Tools
- YARA — The gold-standard pattern matching engine for identifying malware in binaries and scripts; direct inspiration for SuperYARA's rule syntax and operational philosophy. ↩
- SuperYARA (Sara) — Open-source Python library (pip install sara) extending YARA's rule philosophy to semantic similarity, ML classifier, and LLM-based detection of GenAI threats. ↩
- DeBERTa — Microsoft's transformer model fine-tuned for prompt injection and ClickFix detection; used as the third-layer classifier construct in the defense-in-depth pipeline. ↩
- HuggingFace Phishing Detector — Openly available pre-trained phishing classification model used as the high-recall pre-filter construct in the brand impersonation detection pipeline. ↩
- Gemini 2.0 Pro — Google's LLM used as the high-precision final gate in the brand impersonation pipeline; cost benchmark of $750 (no pre-filter) vs. $13.50 (with pre-filter) for 10,000 requests. ↩
- Ollama — Local LLM hosting runtime supported as a backend for SuperYARA's LLM construct, enabling self-hosted deployment without external API costs. ↩
Questions from the audience
Related deep dives
Kinetic Risk: Securing and Governing Physical AI in the Wild | [un]prompted 2026
Securing Workspace GenAI at Google Speed | [un]prompted 2026
The AI Security Larsen Effect - How to Stop the Feedback Loop | [un]prompted 2026