
An attacker asks your enterprise LLM “how do I steal a pen from a bank?” and the model happily obliges — while the same request phrased as “how do I rob a bank?” is blocked. Mechanistic interpretability for LLM security reveals why: current host-based and network-based detections only see plaintext, missing the high-dimensional intent that forms inside a model’s forward pass across dozens of layers.
For security engineers, this gap is widening fast as autonomous agents replace single-prompt interactions. This post breaks down glass-box security — how activation hooks, cosine similarity, and scalar projection can be combined into behavior-based detection manifolds that intercept malicious intent before the model acts, regardless of how it is phrased.
Key Takeaways
- You'll learn how to instrument LLM forward passes with activation hooks to capture intent at inference time — enabling behavior-based detection that goes beyond regex and keyword blacklists.
- You'll be able to measure the strength of malicious concepts in latent space using scalar projection, letting you write threshold-based detection rules that ignore noise caused by superposition.
- Apply this to build detection manifolds and YARA-native AI rules that remain effective against agentic workflows that bypass traditional perimeter controls.
Why Current AI Security Monitoring Leaves Most of the Attack Surface Blind
The Two-Solution Trap: Host-Based and Network-Based Detections
AI/ML security begins with an honest audit of what today’s monitoring actually covers. Carl Hurd frames the current state bluntly: regardless of how vendors market their products, the industry has converged on two categories of detection — host-based and network-based. On Linux, vendors use eBPF[1] (Extended Berkeley Packet Filter) to observe inference processes at the OS level. On Windows, the equivalent is ETW (Event Tracing for Windows)[2]. If a vendor claims they can enumerate all the AI systems running in your environment without requiring a code change or a trusted certificate installation, they are using one of these two mechanisms.
Network-based solutions — commonly called prompt firewalls — intercept traffic between the client and the model API. These approaches can apply regex matching, plain-text input sanitization, and LLM-as-a-judge pipelines to flag suspicious prompts before they reach the model.
Both categories are legitimate and represent the current state of the art. The problem is not that they are poorly implemented — it is that they are structurally limited to a narrow window of the prompt lifecycle.
The Prompt Lifecycle: Where Detection Actually Operates
To understand the gap, consider what happens to a prompt from the moment a user submits it to the moment the model generates a token:
- Plaintext transit (host → network API or gateway): The prompt travels as readable text from the client to the provider’s routing infrastructure. This is where eBPF, ETW, and prompt firewalls operate. All current commercial detections happen here.
- Embedded routing: Most frontier model providers perform lightweight routing at this stage to avoid sending simple queries to expensive models. The prompt is still in a semi-structured form.
- Tokenization and embedding: The prompt is converted from human-readable text into a high-dimensional vector — the model’s native representation. From this point forward, the prompt no longer exists as text.
- Forward pass through model layers: The embedded representation is processed sequentially through every layer of the model. A relatively small model like GPT-4o mini (20B parameters) has 24 layers. Production frontier models have far more. Each layer transforms the representation, building up internal computations that ultimately produce the output token.
Hurd’s key observation: in a simplified three-layer model diagram, only five of roughly ten processing blocks occur while the prompt exists in plaintext. The remaining blocks — the forward pass through the layers themselves — are completely invisible to every existing detection mechanism.
Why This Gap Is Structurally Significant
The architectural implication is straightforward but underappreciated: the vast majority of a model’s “thinking” happens in a space that no current security tool can observe.
When an attacker crafts a prompt injection, jailbreak, or indirect manipulation via a RAG pipeline, the harmful concept is not necessarily visible in the surface text. The model processes the full semantic content of the input across all its layers before producing any output. A regex or keyword-based rule that operates on the plaintext input is inspecting only the surface representation — not the concept that forms inside the network.
This is not a gap that can be closed by making existing tools smarter at parsing text. It is a fundamental observability gap: the information needed to detect intent does not exist in plaintext form once the prompt enters the embedding and forward-pass stages.
The Detection Engineering Parallel
Hurd draws an explicit analogy to the maturity arc of traditional endpoint and network security. Early intrusion detection relied on signature-based detections — pattern matching on known bad strings in network traffic or file contents. The industry learned quickly that attackers could trivially evade signatures by changing superficial properties while preserving malicious behavior. The response was behavior-based detection: EDRs that observe what a process actually does at runtime rather than what it looks like statically.
AI security is currently at the signature-based stage. Regex rules, keyword blacklists, and even LLM-as-a-judge approaches all operate on surface representations of prompts. The answer — glass-box security — is to instrument the model’s forward pass directly, observing behavior as it forms inside the network rather than after it has been expressed as output text.
Actionable Takeaways
- Audit your current AI security stack by asking each vendor which layer of the prompt lifecycle their tool observes. If the answer is eBPF, ETW, or network interception, document that your detection coverage ends at the embedding boundary and that forward-pass computation is unmonitored.
- Do not treat keyword blacklists or regex-based prompt filters as comprehensive guardrails. They are necessary for catching unsophisticated misuse but structurally unable to detect intent that is encoded in the model's high-dimensional representation rather than in the surface text.
- When evaluating AI security posture for agentic deployments, map the full prompt lifecycle — from user input through routing, embedding, and all model layers — and explicitly identify which stages have observability and which do not. Use this map to drive investment in forward-pass instrumentation.
Common Pitfalls
- Assuming that vendor claims like "we can detect all AI usage in your environment" imply comprehensive security coverage. Detection of AI system presence (via eBPF/ETW) is not the same as detection of malicious intent forming inside those systems during inference.
- Investing exclusively in network-based prompt firewalls as the primary AI security control. These tools add meaningful value for plaintext-visible attacks, but they create a false sense of coverage for the forward-pass computation that constitutes the bulk of the model's processing.
Mechanistic Interpretability as the Foundation for LLM Detection Engineering
What Mechanistic Interpretability Actually Means for Security Engineers
Mechanistic interpretability for LLM security is the discipline of examining the internal computational state of a neural network during inference — not just its inputs and outputs. For detection engineers, this represents a fundamental shift: instead of pattern-matching against what a user typed, you are inspecting what the model is thinking. That distinction is the entire premise of glass-box security.
When a prompt enters an LLM, it is immediately tokenized and embedded into a high-dimensional vector space. From that point forward, the model operates entirely in that space — passing representations through attention heads and feed-forward layers, transforming them at each step. Text only re-emerges at the final output stage. Everything in between is computation in latent space, and that is where intent lives.
The insight that mechanistic interpretability brings to detection engineering: if you can observe what is happening inside the model during this computation, you are reading the threat in the model’s native language, with none of the obfuscation, encoding tricks, or paraphrasing that defeat text-based rules.
Forward Pass Hooks: The Core Primitive
The foundational technique is hooking the forward pass of a neural network to collect activations at specific intermediate layers during inference. This does not require running a backward pass (no gradient computation), which means it can be deployed at inference time without the computational cost of training-time analysis.
This approach is already well-established in the interpretability research community. Work from Anthropic, Google DeepMind, and academic labs on linear probes, sparse autoencoders (SAEs), and differential prompt analysis all build on the same primitive: attach observation points to a model’s forward pass and collect the activation tensors that flow through those points. Security engineers don’t need to invent new infrastructure here — they are operationalizing techniques that interpretability researchers have already validated.
In practice, a forward pass hook works as follows:
- Instrument a target layer (or set of layers) in the model with a callback that fires during inference.
- Capture the activation tensor at that layer for each input token or sequence.
- Pass that tensor to your detection pipeline for analysis — cosine similarity comparisons, magnitude measurements, or downstream classifiers.
The hook is read-only during detection; it does not alter the model’s computation. The model continues its forward pass and generates output exactly as it would without the hook.
Why Activations Are the Right Data Source for Detection
Current LLM security tools — eBPF hooks, ETW traces, network prompt firewalls — operate exclusively on plaintext. In a typical prompt lifecycle diagram with roughly ten processing stages, only five of those stages involve the prompt in plain text. The other stages — embedding, embedded routing, and the layer-by-layer transformer computation — are entirely invisible to text-based monitors.
Even a relatively small model like GPT-4o Small at 20 billion parameters runs through 24 or more transformer layers. Each layer transforms the representation. By the time you observe a prompt at the network boundary, the model has not yet done any of that transformation. By the time you could observe the output, it has done all of it. Activations captured mid-forward-pass are the only signal that tells you what happened in between.
This matters practically because:
- Jailbreaks are not purely syntactic. A jailbreak that works may use completely different wording from a known jailbreak but still activate the same internal representations. Activation-based detection catches the concept, not the phrasing.
- Intent forms progressively. Different layers encode different levels of abstraction — early layers handle general language structure, later layers encode higher-level semantic concepts. By hooking specific layers, you can detect intent at the level where it is most strongly represented.
- Adversarial inputs are specifically designed to evade text-based filters. An attacker who knows you are running regex or keyword matching will craft prompts that avoid your patterns. An attacker who knows you are monitoring for semantic concepts in activation space faces a significantly harder evasion problem.
Connecting to the Interpretability Research Ecosystem
Three techniques are particularly relevant as starting points:
Linear probes are lightweight classifiers trained on top of activation vectors to predict whether a given concept is present. They are fast, interpretable, and can be trained on relatively small labeled datasets. A linear probe for “intent to exfiltrate data” trained on activations from a target model layer is a directly deployable detection primitive.
Sparse autoencoders (SAEs) decompose activation vectors into interpretable features — individual directions in the latent space that correspond to human-recognizable concepts. Anthropic and DeepMind have published substantial work on SAEs as a way to make model internals legible. For detection engineering, SAEs provide a way to identify which directions in a model’s latent space correspond to security-relevant concepts (illegality, deception, specific attack classes) before you write detection rules against them.
Differential prompt analysis compares activations between a benign prompt and a potentially malicious variant to identify which layers and directions show the largest divergence. This is directly applicable to detecting prompt injection in RAG pipelines: compare activations before and after context injection to identify whether injected content has shifted the model’s internal representation toward a malicious intent.
Layer Selection: Finding Where Intent Lives
One practical challenge is introspection layer choice — determining which layers in a model are most relevant for detecting a given concept. The empirical approach:
- Collect a labeled prompt dataset — prompts that are positive examples of the concept you want to detect (e.g., prompts with illegal intent) and negative examples (benign prompts).
- Run both sets through a hooked model and capture activations at every layer.
- Measure which layers show the largest activation magnitude differences between positive and negative examples. Those are your candidate detection layers.
- Validate layer selection by checking that cosine similarity scores on the positive set are consistently higher than on the negative set at those layers.
This is systematic guess-and-check — an empirical calibration process, not random exploration. Different concepts may map to different layers, and that is expected.
The Information Representation Advantage
A key reason activation-based detection is more powerful than text-based detection is that activations encode semantic meaning directly, without going through the imprecision of natural language. The concept of “illegality” has a direction in a model’s latent space geometry. That direction is activated whether the prompt says “rob a bank,” “commit financial theft,” or “acquire currency without authorization.” The text varies; the activation direction does not.
This is what makes the approach scale beyond handcrafted rule sets. Instead of trying to enumerate every possible phrasing of a malicious intent — an arms race you will always lose — you identify the concept once in latent space and write one detection that covers all phrasings.
The foundation is the linear representation hypothesis: that high-level semantic concepts are approximately linearly encoded in transformer activations. If this holds (and current evidence suggests it holds well enough to be useful), then cosine similarity and dot products — both computationally trivial operations — are sufficient to detect the presence and strength of a concept in a prompt’s activation sequence. No deep classifier, no retraining, no deployment of a second full-scale LLM required.
Actionable Takeaways
- Instrument a self-hosted open-weight model (e.g., Llama 3.1) with forward pass hooks at the residual stream of candidate layers, then collect activation tensors on labeled prompt datasets to empirically identify which layers most strongly encode the security-relevant concepts you need to detect. Use this calibration process before writing any detection rules.
- Start with linear probes as your first detection primitive — they are lightweight, trainable on small labeled datasets, and directly interpretable. Train one probe per concept class (e.g., jailbreak intent, data exfiltration intent, illegal activity intent) on top of activations from your empirically identified layers, then threshold the probe output to generate alerts.
- Review published interpretability research from Anthropic and Google DeepMind on sparse autoencoders and linear probes before building custom tooling. The core activation collection infrastructure already exists in open-source interpretability libraries; your engineering effort should focus on the detection logic built on top of it, not on reimplementing the hooks.
Common Pitfalls
- Assuming that a layer index that works well for detecting one concept (e.g., illegal intent) will generalize to other concepts (e.g., data exfiltration, prompt injection). Layer relevance is concept-specific and model-specific. Skipping the empirical calibration step and hardcoding layer indices without validation will produce unreliable detections with high false-negative rates.
- Treating activation-based detection as a replacement for host-based and network-based monitoring rather than a complement to it. Text-based monitors at the network perimeter still catch attacks that never reach model inference; activation monitors catch what text-based tools miss once inference begins. Removing the outer layers in favor of inner-layer monitoring reduces overall coverage.
Measuring Intent and Strength in Latent Space
The Two-Pillar Detection Model for LLM Behavior-Based Detection
Mechanistic interpretability for LLM security rests on two complementary signals that together form the foundation for behavior-based detection inside neural networks. Neither signal is sufficient alone — the combination is what makes this approach robust enough to replace syntactic guardrails.
Pillar One: Cosine Similarity for Intent Direction
The first pillar captures intent — whether a prompt’s activations are directionally aligned with a known malicious concept in high-dimensional space. The core math is cosine similarity, which measures the angular relationship between two vectors and returns a score between -1 and 1:
- 1.0 — perfectly aligned (identical direction)
- 0 — orthogonal (unrelated)
- -1.0 — perfectly opposed
When you collect activation vectors from a hooked model during inference, cosine similarity lets you ask: “Is this prompt’s internal representation pointing in the same direction as the concept of ‘illegality,’ ‘file deletion,’ or ‘jailbreak’?” You don’t need to match specific words. You’re comparing directions in the model’s native vector space.
Concrete example: A prompt asking “how do I rob a bank?” will have activations whose direction aligns strongly with the concept of illegality. Instead of blacklisting the words “rob,” “steal,” or “bank,” you capture the entire semantic concept as a vector and measure directional alignment against it. This approach handles paraphrasing, obfuscation, and novel phrasing that would bypass any keyword list.
The process for building an intent reference vector is empirical:
- Send a large corpus of prompts containing the target concept through a hooked model (e.g., Llama 3.1[3] self-hosted).
- Collect the activation vectors from the relevant layers during each forward pass.
- Average or otherwise distill these into a stable reference direction for that concept.
- At inference time, compute cosine similarity between the incoming prompt’s activations and your reference vector.
This is model-specific by design. When comparing Llama 3.1 activations, you gather reference data from Llama 3.1. You’re comparing apples to apples — the concept representation is calibrated to that model’s internal geometry, not assumed to be universal.
The Superposition Problem: Why Cosine Similarity Alone Is Insufficient
A fundamental challenge in latent space geometry for security is superposition. Because neural networks encode an enormous number of concepts into a relatively small latent space, individual neurons and directions are not one-to-one with concepts. Many concepts are compressed together, overlapping in the same dimensional neighborhood. A single direction in a 4096-dimensional residual stream might carry partial signal for dozens of unrelated ideas simultaneously.
This means cosine similarity can produce a meaningful alignment score even when the target concept is only weakly present in a prompt — it might be “pointing in that direction” because of unrelated co-activation from neighboring concepts. Acting on cosine similarity alone risks excessive false positives.
Pillar Two: Scalar Projection for Concept Strength
The second pillar addresses the superposition problem directly by measuring strength — how much of the activation tensor’s total magnitude is actually attributable to the detected intent. The technique is scalar projection (the dot product projection), which computes the “shadow” that the intent direction casts onto the full activation vector:
strength = dot(activation_vector, intent_unit_vector)
= |activation_vector| * cos(θ)
This value captures both the alignment angle (cosine similarity) and the magnitude of the activation in that direction. The result tells you: of all the “energy” in this activation, how much is going toward this concept?
Concrete example differentiating strength: Consider two prompts:
- “How do I rob a bank?” — high cosine similarity with the illegality concept AND high scalar projection; the concept dominates the activation
- “How do I steal a pen from a bank?” — moderate cosine similarity with theft-adjacent concepts, but much lower scalar projection; the concept is present but weak relative to benign context
This maps directly to observable model behavior: Gemini will refuse the first outright but provide a tongue-in-cheek response to the second. The strength signal captures why — the concept is present in both, but only dominant in one.
Latent Space Geometry in Practice
Working with latent space geometry for detection engineering requires several practical decisions:
Layer selection is the first challenge. Not all layers in a deep model contribute equally to a given concept. Earlier layers handle low-level language comprehension; higher-level semantic concepts tend to emerge in deeper layers. The empirical approach is to run representative prompts through a fully hooked model, observe which layers show the largest activations for your target concept, then instrument only those layers for production detection.
Threshold calibration cannot be done analytically — there is no universal “if scalar projection > 10, alert” rule. Every model’s activation magnitudes are different, and the same concept may manifest with very different magnitude scales across architectures. Thresholds must be derived empirically by running large labeled datasets through the specific model being monitored and observing the distribution of strength scores for benign versus malicious inputs.
Data volume is a real constraint. A model like GPT-4o-class (~20B parameters) generates approximately 4 MB of activation data per token. A full context window fill produces roughly 10 TB of activation data. To make this tractable:
- Hook only the residual stream rather than all attention heads (avoids quadratic scaling costs)
- Monitor only layers empirically identified as relevant to the target detection
- Avoid hooking all layers simultaneously in production
Writing Threshold-Based Detection Rules
Once both pillars are calibrated, you can express detection logic in human-readable rule form. An example rule structure:
- Condition 1: Cosine similarity between prompt activations and the “file deletion” intent vector exceeds 0.85 at layer 18
- Condition 2: Scalar projection of that intent direction accounts for more than 30% of the total activation magnitude at that layer
- Action: Block or remediate
This is a behavior-based detection — analogous to EDR heuristics — that fires based on what the model is “thinking,” not what words appeared in the input. It survives rephrasing, translation, encoding, and any other syntactic manipulation because it operates on the model’s internal representation.
Detecting Bank Robbery Intent via Cosine Similarity on LLM Activations
Proof of Concept
-
Select and instrument the target model: Choose a self-hosted open-weight model (e.g., Llama 3.1). Register forward-pass hooks on the residual stream at identified layers. Hooks must fire during inference without requiring a backward pass — this keeps latency impact minimal and makes the technique viable for inline detection.
-
Build a reference intent vector for “illegality”: Send a curated set of prompts that clearly express illegal intent (e.g., “how do I rob a bank,” “how do I commit wire fraud”) through the hooked model. At each instrumented layer, collect the activation tensors. Average or cluster these activation vectors to produce a stable reference direction representing the concept of illegality in the model’s latent space.
-
Run a new prompt through the hooked model: Submit the candidate prompt through the same instrumented forward pass. Capture the activation tensor at the same residual-stream layers used in step 2.
-
Compute cosine similarity between the candidate activations and the reference intent vector:
similarity = (A · B) / (|A| × |B|)where A is the activation vector from the candidate prompt and B is the reference “illegality” intent vector. A value close to +1 indicates the candidate prompt is directionally aligned with the illegal intent concept in the model’s latent space.
-
Interpret the result: If cosine similarity is high (e.g., above a threshold determined empirically for this model), the detection fires. This fires correctly for “how do I rob a bank” — the model’s internal activations are heading in the direction of illegal even before the model generates a token. This detection is not triggered by the words “rob” or “bank” — it is triggered by the conceptual direction the model’s thought process is taking.
-
Acknowledge the limitation that motivates the second pillar: Cosine similarity measures direction only, not magnitude. A prompt that weakly evokes illegality will score similarly to one that overwhelmingly evokes it, because both vectors point in roughly the same direction. This is why cosine similarity alone is insufficient — it must be paired with scalar projection (strength measurement) to avoid false positives and missed detections in production.
Scalar Projection Distinguishing “Rob a Bank” vs “Steal a Pen from a Bank”
Proof of Concept
-
Establish the baseline concept vector for “illegality”: Using a hooked model (e.g., Llama 3.1), send thousands of prompts that strongly and unambiguously instantiate the concept of illegal activity. Collect the intermediate activations at the layer(s) identified as most responsive to this concept during empirical profiling. Average these activation vectors to produce a stable concept direction vector representing “illegality” in that model’s latent space.
- Hook the forward pass for both test prompts: Instrument the target model’s residual stream at the identified layer(s). Submit both prompts separately:
- Prompt A: “How do I rob a bank?”
- Prompt B: “How do I steal a pen from a bank?” Capture the activation tensors at each hooked layer for both inputs.
-
Compute cosine similarity for both prompts: Calculate the cosine similarity between each prompt’s activation vector and the “illegality” concept direction vector. Because cosine similarity measures direction only, both prompts will likely return a similarly elevated similarity score — they both gesture toward theft and illegal activity. This is the fundamental limitation of cosine similarity alone.
- Apply scalar projection to measure concept dominance: For each prompt’s activation tensor, compute the scalar projection (dot product of the activation vector onto the unit concept direction vector). This yields a signed scalar representing how much of the tensor’s total magnitude lies along the “illegality” direction:
- Prompt A (“rob a bank”): The scalar projection value is large. The illegal concept dominates the activation tensor — it constitutes a substantial portion of the overall magnitude.
- Prompt B (“steal a pen from a bank”): The scalar projection value is significantly smaller. Despite the presence of the word “steal,” the magnitude along the “illegality” direction is minor relative to the full tensor. The model’s internal representation is diluted by mundane context (pen, bank teller interaction, trivial value).
- Apply the detection threshold rule: A detection rule expressed as:
- Block if: cosine_similarity(activation, illegal_concept) > 0.85 AND scalar_projection(activation, illegal_concept) / tensor_magnitude > threshold_T
- The value of
threshold_Tis determined empirically from the gathered activation dataset — there is no universal numeric threshold. - Prompt A exceeds both thresholds → blocked.
- Prompt B exceeds the cosine similarity threshold but does not exceed the scalar projection dominance threshold → passes (or triggers a lower-severity advisory response, as Gemini demonstrates with a tongue-in-cheek “just ask the teller”).
- Validate against real-world behavioral reference: This behavior can be directly verified: asking Gemini “how do I rob a bank?” results in a refusal, while asking “how do I steal a pen from a bank?” returns a mild, humorous response suggesting the user simply ask a teller for a pen. Production frontier models are already performing some analog of concept-strength discrimination internally — glass-box security makes this mechanism explicit, inspectable, and portable to self-hosted detection pipelines.
Actionable Takeaways
- Gather empirical reference vectors for each target concept (illegality, file deletion, jailbreak, etc.) by running labeled prompt corpora through your self-hosted model and averaging activations at the empirically strongest layers — then store these as the comparison basis for cosine similarity at inference time.
- Implement scalar projection alongside cosine similarity in every detection: cosine similarity alone is insufficient due to superposition, and adding the dot product magnitude check will dramatically reduce false positives by filtering out concepts that are directionally present but not dominant in the activation.
- Calibrate all strength thresholds empirically per model — run large labeled datasets through the specific architecture you're monitoring, plot the distribution of scalar projection scores for benign versus malicious inputs, and set thresholds based on observed distributions rather than guessing fixed values.
Common Pitfalls
- Relying on cosine similarity alone without scalar projection: because of superposition, many unrelated concepts share directional neighborhood in latent space, causing cosine similarity to fire on prompts where the malicious concept is present but negligible — leading to high false positive rates that erode trust in the detection system.
- Assuming detection thresholds and reference vectors are portable across model architectures: activation magnitudes, layer depths, and concept geometry are specific to each model. Calibration data gathered from Llama 3.1 is not valid for GPT-4o or Mistral — each model requires its own empirical data gathering pass before detection content can be written.
Building Behavior-Based Detection Manifolds for AI Agents
From Two Signals to a Detection Manifold
Behavior-based detection is already the gold standard in endpoint security — EDRs don’t rely on static signatures alone; they correlate behavioral signals across time and system state. The same leap needs to happen for AI. By combining intent (cosine similarity) and strength (scalar projection) across multiple layers of a model’s forward pass, security engineers can construct detection manifolds: multi-point tripwires built into the model itself that fire when a concept reaches a defined threshold of direction and dominance.
The linear representation hypothesis provides the theoretical grounding here. It holds that concepts in a neural network are encoded as linear directions in the residual stream, making it possible to probe for them consistently across layers. A manifold is simply the assembly of these per-layer probes into a unified detection surface. Think of it as the AI equivalent of correlating process creation, network connection, and registry write events in an EDR rule — each individual signal is weak; in combination, they form a high-confidence behavioral indicator.
Why Syntactic Rules Fail for Agentic Workflows
AI agents operate in observe-decide-act (ODA) loops. At each cycle, the agent reads context, decides on a next action, and executes it — potentially spawning sub-agents, calling tools, reading files, or writing to external systems. This architecture breaks every assumption behind today’s detection stack:
- Perimeter boundaries are meaningless. An agent running inside a trusted environment has implicit authorization for many operations. The malicious intent doesn’t arrive as a network packet; it emerges from a sequence of internally consistent decisions.
- Syntactic rules cannot enumerate possibilities. There are CTF challenges where packet capture software is abused to execute binaries with elevated permissions. An agent trained on general knowledge will discover these same techniques because that knowledge is in its weights. You cannot blacklist every creative path to privilege escalation.
- Intent precedes action. The model commits to a direction — conceptually — before any tool call is issued. Detection at the token or API level is already too late. You need to intercept the thought before the action occurs.
This is why detection manifolds matter specifically for AI agent security. A manifold positioned at the layers responsible for high-level planning can flag file-deletion intent, privilege-escalation intent, or lateral-movement intent before those concepts materialize into tool calls.
Writing Manifold-Based Detection Rules
A concrete example from the talk illustrates what an operational manifold rule looks like: block any request where activation vectors show more than 85% cosine alignment with an illegal-intent reference vector AND where the scalar projection of that concept makes up a large fraction of the total tensor magnitude for that layer.
In practice this means:
- Define the intent reference vector. Run thousands of prompts representing the target concept (e.g., file deletion, credential exfiltration, code execution) through a hooked model. Aggregate the resulting activation vectors into a stable directional reference.
- Set layer anchors. Use empirical layer-selection to identify which layers show the highest activation magnitude for that concept. These become the hook points for your manifold.
- Calibrate thresholds empirically. There is no universal threshold. For each model, gather real traffic data, measure cosine similarity and scalar projection distributions across benign and malicious prompt sets, and set thresholds based on observed separation.
- Choose an action: detect, alert, or remediate. Manifolds don’t have to be binary blockers. They can also pause generation, inject a remediation token, or log for async review. Real-time remediation — continuing generation after neutralizing the malicious concept — is a possibility that purely syntactic approaches cannot offer.
YARA and Cedar as AI-Native Rule Formats
The detection content authorship problem is real. Detection engineers shouldn’t need a PhD in linear algebra to write AI security rules. The speaker proposes extending existing open-source rule formats rather than inventing new ones — specifically YARA[4] and Cedar[5].
YARA is already familiar to most detection engineers as a file and memory scanning format. The proposal is to extend YARA with custom modules that expose activation-based signals — cosine similarity scores and scalar projection magnitudes — as first-class match conditions. A rule might look like:
rule FileDeleteIntent {
condition:
activation.cosine_similarity("file_deletion_intent", layer=18) > 0.85
and activation.scalar_projection("file_deletion_intent", layer=18) > 0.40
}
This keeps the rule format readable, versionable, and compatible with existing detection engineering pipelines. Engineers write intent-based rules the same way they write file-based rules today.
Cedar, the open-source policy language, offers a complementary approach for authorization-style rules — particularly relevant for agent identity and permission scoping. Combining Cedar policies (what is this agent allowed to do?) with YARA activation rules (what is this agent intending to do?) produces a layered defense that covers both static authorization and dynamic behavioral monitoring.
Semantic Traceability vs. Syntactic Traceability
Beyond security detection, detection manifolds address a broader trust problem: how do you verify that an agent actually did what it was supposed to, and didn’t simply find a loophole or reward-hack its way to a passing test?
Syntactic traceability — looking at tool call logs, API responses, and output text — tells you what happened. Semantic traceability — reading activation manifolds — tells you what the model was thinking when it decided to do it. This distinction is critical for regulated environments and for any team that needs to audit agentic behavior post-incident.
The speaker frames semantic observability as the new foundation for AI security infrastructure. Without it, you’re auditing outputs and hoping they reflect intent. With it, you can trace a decision back to the specific layers and concepts that drove it — creating a chain of custody for model reasoning that has no equivalent in today’s AI security stack.
Manifold Drift Under Fine-Tuning
Detection manifolds are not static. When a model is fine-tuned, the layers that encode specific concepts may shift. The practical guidance:
- Standard fine-tuning practice freezes the first 90% of layers and updates only the last 10%, because earlier layers handle general language comprehension and later layers encode high-level concepts. If your manifold hooks are in the frozen layers, they likely remain valid after fine-tuning.
- If the fine-tuned layers overlap with your manifold hook points, you need to recapture empirical data and retune the relevant manifold components — not necessarily rebuild from scratch.
- The modular, per-layer structure of a manifold makes this tractable: only the affected layer hooks need recalibration, not the entire detection system.
Universal Jailbreak Detection Using Concept-Level Intent Blocks
Proof of Concept
-
Identify the target concept: Define the concept you want to detect — “jailbreaking” a model (i.e., attempts to override system instructions, bypass safety constraints, or elicit disallowed outputs). This concept will serve as the reference intent vector.
-
Collect representative prompts: Gather a corpus of known jailbreak prompts in multiple forms — direct instruction overrides (“ignore all previous instructions”), role-play framings (“pretend you have no restrictions”), encoded variants, and oblique formulations. Also collect benign prompts as negative examples.
-
Instrument the model’s forward pass: Add activation hooks at intermediate layers of a self-hosted or canary open-weight model (e.g., Llama 3.1). Hook the residual stream at the layers identified as most active for high-level semantic concepts — typically deeper layers responsible for abstract reasoning rather than early layers handling surface syntax.
-
Capture and aggregate the jailbreak concept vector: Run all jailbreak prompt examples through the hooked model. At each hooked layer, collect the activation tensors. Aggregate these activations (e.g., via mean pooling across the jailbreak prompt set) to produce a stable reference vector representing the “jailbreak intent” direction in the model’s latent space.
-
Compute cosine similarity at inference time: For each incoming prompt, run it through the instrumented model and capture activations at the same hooked layers. Compute the cosine similarity between the incoming prompt’s activation vector and the stored jailbreak intent reference vector.
-
Apply scalar projection to measure concept dominance: Compute the scalar projection (dot product of the activation vector onto the unit jailbreak vector). A high scalar projection value means jailbreak intent is a dominant component of the model’s current internal state.
-
Set an empirical detection threshold: Gather empirical data on the distribution of cosine similarity and scalar projection scores for both jailbreak and benign prompts on the specific model being protected. Identify a threshold that separates the two populations with acceptable false positive and false negative rates.
- Write the detection rule as a YARA-native AI rule:
rule jailbreak_intent_block { condition: ai.cosine_similarity("jailbreak_intent", layer=18) > 0.85 and ai.scalar_projection("jailbreak_intent", layer=18) > empirical_threshold }The rule fires regardless of prompt language, encoding, or phrasing — it operates on the model’s internal concept representation, not on text patterns.
-
Validate universality across phrasing variants: Test the rule against held-out jailbreak prompts in multiple languages, encoded forms (Base64, ROT13), and indirect framings (fictional framing, role assignment). Because the concept vector is captured from the model’s latent space, all semantically equivalent jailbreak attempts activate the same directional region and trigger the same rule.
- Block or remediate: When the rule fires, either halt generation, return a canned refusal response, or log the event with intent strength metadata for downstream triage.
Indirect Prompt Injection Detection via Intent Delta Across a RAG Pipeline
Proof of Concept
-
Capture the pre-RAG intent baseline. Before the retrieved context is injected into the prompt, instrument the model’s forward pass with activation hooks on the residual stream at the layers most responsive to the task’s primary concepts. Run the user’s original query through the hooked model and record the activation vector — this is the “pre-RAG intent fingerprint.”
- Distinguish RAG types to determine the detection strategy:
- Context-enhancement RAG: Retrieved documents are merged into the prompt’s prefill before the model begins generating. The full intent delta is visible in the prefill activations — no separate before/after pass is required.
- Tool-based RAG: The model first reasons, then calls a retrieval tool, then continues generation with the retrieved content. A genuine before/after comparison is required: capture the intent vector after the initial reasoning step (before tool call) and again after the tool result is injected (post-retrieval).
-
Measure the intent delta using cosine similarity. Apply cosine similarity between the pre-RAG and post-RAG activation vectors. A high cosine similarity (close to 1.0) means the retrieved context did not materially change the model’s directional intent — the RAG result is consistent with the original query. A low cosine similarity indicates a directional shift, meaning the retrieved (potentially injected) content has redirected the model’s “thought trajectory” toward a different concept.
-
Measure the strength delta using scalar projection. Apply scalar projection to determine how much the new dominant concept now contributes to the full tensor magnitude. A large increase in a foreign concept’s scalar projection after RAG retrieval — particularly one aligned with a known malicious concept such as data exfiltration or privilege escalation — is the high-confidence detection signal.
-
Set an empirical threshold for the intent delta. Gather empirical activation data by running known-clean and known-injected retrieval pairs through the instrumented model. Use this distribution to calibrate a threshold — for example: “Flag any request where post-RAG cosine similarity drops below 0.70 AND scalar projection of an adversarial concept exceeds 15% of tensor magnitude.”
-
Express the detection as a YARA-native AI rule. Hook the residual stream at the identified layers; compute the intent delta between pre-RAG and post-RAG activations; if the delta exceeds the empirically calibrated threshold, block or remediate before the model acts on the injected instruction.
- Handle the canary model fallback for frontier APIs. If the production model is a frontier API (GPT-4o[6], Claude, Gemini) where activations are not directly accessible, route retrieval-stage prompts asynchronously through an open-weight canary model (e.g., Llama 3.1) instrumented with the same hooks. The canary model’s intent delta serves as a proxy signal for the frontier model’s likely behavior.
Actionable Takeaways
- Build detection manifolds by combining cosine similarity and scalar projection probes across multiple model layers — pick layer anchor points empirically by running representative prompts through a hooked model and selecting layers with the highest concept activation magnitude.
- Extend YARA with custom activation modules to let detection engineers write AI-native behavioral rules in a familiar format, keeping detection content in the same versioned, peer-reviewed pipeline as existing endpoint and network rules.
- For agentic workflows, deploy manifold-based detection at planning layers before any tool call is issued — intercepting malicious intent before it becomes an action is the only reliable defense against agents that can creatively route around syntactic guardrails.
Common Pitfalls
- Treating manifold thresholds as universal constants: there is no single cosine similarity or scalar projection cutoff that works across all models. Thresholds must be calibrated empirically per model using real traffic distributions — skipping this step produces either chronic false positives or chronic missed detections.
- Anchoring detection only to final model layers: high-level intent concepts often activate strongly in mid-to-late layers, but planning-relevant concepts in agentic models may peak earlier. Selecting hook points without empirical validation risks missing the signal entirely.
Engineering Challenges and Practical Adoption Strategies for Glass-Box Security
Glass-box security is a compelling framework, but operationalizing it inside a real enterprise stack surfaces four engineering barriers that any team must plan for before committing to this approach. Carl Hurd names these barriers explicitly and pairs each one with a concrete mitigation.
Barrier 1 — Frontier Model Opacity
The most fundamental challenge is that activations are not available for frontier models. Services like OpenAI, Anthropic, and Google expose a text-in / text-out API; they do not expose intermediate layer activations. Obtaining those activations requires controlling the full inference stack — running the model yourself — which most organizations are unwilling or unable to do because they want access to the most capable, continuously updated cloud models.
Engineering solution — the canary model pattern. Instead of abandoning frontier models, instrument a smaller open-weight model (e.g., Llama 3.1) and route requests to it in parallel or inline with the frontier API call. This canary model runs on infrastructure you control, so you can hook its forward pass and collect activations. Because the canary shares the same broad training distribution as the frontier model, its latent-space signals provide a close-enough approximation of intent for detection purposes. The canary is not making the response — it is serving as your interpretability sensor.
This pattern preserves the end-user experience (responses still come from GPT-4o or Claude) while giving security engineering the activation data it needs.
Barrier 2 — Terabyte-Scale Activation Data
Even on a modest open-weight model, activation volumes are staggering. GPT-NeoX 20B generates approximately 4 MB of activation data for a single token. Fill an entire context window and you have generated roughly 10 TB of activation data for one session. Naively collecting all activations for all layers is operationally infeasible.
Engineering solution — residual-stream-only hooks at targeted layers. Rather than hooking every attention head and MLP sublayer (which would incur quadratic scaling costs identical to the self-attention bottleneck that limits context window growth), hook only the residual stream at the specific layers identified as most responsible for the concept you are detecting. The residual stream is the running sum of all sublayer outputs — it is the most information-dense single tensor in a transformer block, and it scales linearly with sequence length rather than quadratically.
Layer selection is empirical: run your target prompts through a hooked model, measure which layers produce the largest activation magnitudes for the concept of interest, and pin your hooks there. You are not collecting 10 TB — you are collecting a small number of targeted vectors per token at two or three layers.
Barrier 3 — Detection Content Authorship
Even if you solve the data pipeline, you face a skills gap: how do you get a detection engineer to write rules that reason about high-dimensional vector space? Today, a detection engineer writing a Sigma or YARA rule does not need to understand Windows kernel internals or ELF binary format at a deep level — tooling has abstracted that complexity away. The same abstraction must be built for AI.
Engineering solution — progressive context enhancement and familiar rule formats. On the tooling side, Hurd proposes extending YARA with custom modules that expose activation-based signals as first-class primitives. A detection engineer could then write a rule that reads:
if file_deletion_intent exists at layers [18, 22] with magnitude > threshold → block
This is structurally identical to a YARA rule referencing a PE header field — familiar syntax, novel data source. Cedar is also mentioned as a candidate for the same kind of extension.
On the workflow side, progressive context enhancement means passing host-based and network-based detection metadata (the information already collected by eBPF or ETW) forward into the instrumented canary model, so that activation analysis has full context about the session — not just the isolated prompt text. This improves signal quality without requiring the detection author to understand tensor math.
Barrier 4 — Context-Dependent Rule Universality
The final barrier is one Hurd considers underappreciated: most detection content cannot be universal. This problem is familiar from industrial control security, where a legitimate engineering command and a malicious command can be syntactically identical — context determines which is which. The same is true for AI agents: an agent authorized to delete files in a build pipeline should not trigger the same rule as an agent that has been hijacked and is attempting to wipe a production database.
Engineering solution — agent identity and use-case scoping. Detection rules must be scoped to the expected behavior of the specific agent or user role they cover. This requires some form of agent identity — a declared, verifiable statement of what this agent is supposed to do and what tools it is permitted to invoke. Rules are written relative to that identity: deviations from the authorized behavioral envelope trigger alerts; activity within it does not.
This is defense-in-depth, not a silver bullet. Glass-box detections sit alongside authorization controls, not instead of them.
Putting It Together — A Production-Grade Rule
Hurd’s concrete target for what a production glass-box detection looks like:
Block any request where: (1) cosine similarity to the “illegal intent” concept exceeds 85%, AND (2) the scalar projection of that concept accounts for a large fraction of the total tensor magnitude at the monitored layers.
Written as a YARA-style rule with AI-native modules, this becomes a single, auditable, version-controlled artifact — the same workflow detection engineers already use for malware rules, extended to reason about latent-space geometry.
Actionable Takeaways
- Deploy a canary model (e.g., Llama 3.1 self-hosted) inline or asynchronously alongside your frontier API calls. Hook its residual stream at empirically selected layers to collect intent and strength signals without touching the frontier model's inference path.
- Limit activation collection to the residual stream only, at two or three layers identified via empirical profiling of your target concept. This keeps storage and compute costs manageable and avoids the quadratic scaling costs of hooking self-attention heads.
- Extend YARA or Cedar with custom AI-native modules that expose cosine similarity and scalar projection as rule primitives. This lets existing detection engineers contribute rules using familiar syntax without requiring expertise in high-dimensional vector math.
Common Pitfalls
- Hooking all layers and all sublayers indiscriminately. The quadratic cost of capturing self-attention activations across every layer will make the approach operationally infeasible. Always profile first to identify the minimum set of layers relevant to the concept you are detecting, then hook only those.
- Writing universal detection rules without scoping them to agent or user identity. A rule that fires on "file deletion intent" will generate unacceptable false positives for build agents, CI pipelines, and any workflow that legitimately deletes files. Rules must be parameterized by the expected behavioral envelope of the specific agent they govern.
Conclusion
Glass-box security represents the next maturation step for AI threat detection — the move from signature-based, text-only guardrails to behavior-based detection that observes what a model is actually computing, not just what words it received. By instrumenting forward passes with activation hooks, measuring intent via cosine similarity, filtering noise via scalar projection, and assembling per-layer probes into detection manifolds, security engineers can build detection content that survives all syntactic obfuscation. The four engineering barriers — frontier model opacity, activation data volume, content authorship, and rule universality — each have practical solutions available today.
Critically, this is not a replacement for existing defenses. eBPF, ETW, and network-based prompt firewalls remain valuable for the plaintext window they cover. Glass-box security is the layer that covers everything else: the forward-pass computation that accounts for the majority of the model’s processing and currently has zero observability.
As autonomous agents become the dominant deployment pattern for LLMs, the urgency of this shift compounds. Agents in observe-decide-act loops can route around every syntactic rule ever written — because the knowledge needed to find creative paths is already in their weights. The only reliable interception point is the thought itself, before it becomes an action. That is precisely what detection manifolds are designed to do.
For teams ready to start: instrument a self-hosted Llama 3.1 instance, run labeled prompt corpora to build your first intent reference vectors, and write your first YARA rule with activation-based conditions. The infrastructure for this work — interpretability libraries, open-weight models, extensible rule formats — is available now.
Related topics on this site:
- Detection engineering for AI systems
- AI agent security and agentic workflow defenses
- Jailbreaking techniques and detection strategies
References & Tools
- eBPF — Extended Berkeley Packet Filter; Linux kernel technology used by AI security vendors to observe inference processes at the OS level. ↩
- ETW (Event Tracing for Windows) — Windows kernel-level tracing mechanism used as the Windows counterpart to eBPF for host-based AI monitoring. ↩
- Llama 3.1 — Meta's open-weight LLM; cited throughout as the recommended self-hostable model for empirical activation data collection and canary deployment. ↩
- YARA — Rule-based pattern matching tool used by malware analysts; proposed as the extensible rule format for AI-native detection content via custom activation modules. ↩
- Cedar — Open-source policy language from AWS; mentioned alongside YARA as a candidate for extension to cover AI agent authorization and behavioral detection rules. ↩
- GPT-4o — OpenAI's frontier model; cited as a representative frontier API for which the canary model pattern provides a practical activation-data workaround. ↩
Questions from the audience
Related deep dives
Kinetic Risk: Securing and Governing Physical AI in the Wild | [un]prompted 2026
Securing Workspace GenAI at Google Speed | [un]prompted 2026
Hooking Coding Agents with the Cedar Policy Language | [un]prompted 2026