![Rob T. Lee presenting talk - Rob T. Lee - SIFT-FIND EVIL! I Gave Claude Code R00t on DFIR SIFT Workstation | [un]prompted 2026 at unprompted 2026](https://thecyberarchive.com/assets/teasers/ai-dfir-14-minutes-claude-code-sift-workstation-rob-t-lee.webp)
AI-assisted digital forensics incident response just crossed a threshold that should reframe how every blue team operates: a full day of manual DFIR work — AMCache analysis, prefetch, event logs, memory forensics, timeline reconstruction — completed in 14 minutes 27 seconds. Rob T. Lee demonstrated at [un]prompted 2026 that giving Claude Code root access to the SIFT Workstation, guided by a carefully crafted CLAUDE.md orchestrator and tool-specific skills files, produces 100%-accurate, MITRE ATT&CK-mapped reports from a single natural-language command.
For security engineers, this isn’t a novelty — it’s a forcing function. Anthropic’s own research documented adversaries using Claude Code to accelerate offensive operations from days to seconds, meaning defenders who still rely on hands-on-keyboard forensics are operating at a structural speed disadvantage. This post breaks down the exact architecture Rob built, how it handles errors and context degradation, and what the community hackathon challenge means for enterprise-grade AI DFIR.
Key Takeaways
- You'll learn how to integrate Claude Code with the SIFT Workstation using a CLAUDE.md prime directive and skills.md files, enabling fully autonomous AI-assisted digital forensics that compresses a full day of manual DFIR work into under 15 minutes.
- You'll be able to identify and mitigate context rot in long-running agentic DFIR sessions by building self-correction directives into your CLAUDE.md orchestrator and understanding when to restart sessions to maintain analysis accuracy.
- Apply this to defend at the speed of AI-powered attackers: by matching offensive AI automation with defensive AI automation, your IR team can scope compromises, hunt persistence, and generate comprehensive MITRE ATT&CK-mapped reports before attackers pivot.
The Speed Gap: Why AI-Assisted Incident Response Is Now Mandatory
The Asymmetry That Is Reshaping Incident Response
AI-assisted digital forensics incident response is no longer a research curiosity — it is a direct counter to a documented threat. Two days before Rob T. Lee delivered this talk at [un]prompted 2026, Anthropic published its own research detailing how adversaries were using Claude Code[1] on offensive operations. The finding was stark: AI-accelerated offensive tooling compresses attack timelines from days — sometimes weeks — down to seconds and minutes. As Rob put it, “if you’ve started playing around with this or started taking a look at the speed of what is documented by Anthropic on what they observed, it is mind-numbing.”
The implication for defenders is direct. If attackers can execute, pivot, and establish persistence at machine speed, incident response teams relying on manual, hands-on-keyboard forensics are operating under a structural speed disadvantage that no amount of analyst skill can fully compensate for.
The Manual DFIR Baseline: What You’re Up Against
To understand why the gap matters, consider what a standard DFIR investigation actually requires on a single compromised system. A thorough analysis covers:
- AMCache analysis — identifying program execution artifacts
- Prefetch analysis — correlating execution evidence
- Event log analysis — reconstructing authentication, process, and network events
- Timeline reconstruction — merging artifacts into a coherent chronology using tools like Plaso/log2timeline[2]
- Memory forensics — identifying injected code, malicious processes, and live network connections
- Final report generation — executive summary, MITRE ATT&CK mapping, remediation recommendations
Rob polled the audience at [un]prompted on a simple question: how long does it take your team to go from first touching a disk image to delivering a final report? The answers were instructive — three days, two to three days, sometimes a week. Even optimistic estimates landed around a full day.
This is the manual baseline that AI-accelerated offensive operations are now exploiting. Attackers operating with AI tooling can move through an environment in minutes. Defenders reconstructing what happened still need days.
The Threat Parity Argument
Rob’s framing is deliberately strategic: “the idea is matching AI speed with AI speed.” This is not an argument for eliminating human judgment from DFIR — it is an argument for removing the manual execution bottleneck from the parts of the workflow that do not require it.
The parallel with offensive AI use is not accidental. Anthropic’s own threat research documented adversaries combining Claude Code with operational toolsets and “letting it go” — the same conceptual architecture Rob applied defensively. His experiment with Claude Code on the SIFT Workstation[3] was motivated precisely by that offensive use case: if attackers can automate their kill chain with AI coding agents, defenders need an equivalent capability on the other side.
The result of his proof of concept — a full day of manual DFIR work completed in 14 minutes 27 seconds — is not presented as a product or a finished solution. It is presented as evidence that the speed gap is closeable, and that the tools to close it are already available in the open-source ecosystem.
What This Means for Security Engineers
The strategic framing Rob establishes sets the context for every technical decision in the architecture that follows. The goal is not automation for its own sake. The goal is defensive speed parity with AI-powered attackers — the ability to scope a compromise, hunt for persistence, track lateral movement, and generate a court-quality forensic report before an attacker has time to pivot to the next system.
Understanding this framing is essential before evaluating the technical choices: why root access was granted to Claude Code, why the CLAUDE.md orchestrator enforces deterministic tool paths, and why the community hackathon is framed around enterprise trustworthiness rather than raw capability. Speed without reliability creates a different kind of risk. The architecture Rob built is a direct response to both problems simultaneously.
Actionable Takeaways
- Benchmark your current DFIR cycle time — from first image acquisition to final report delivery — and treat it as a gap metric against AI-accelerated offensive timelines. If your team takes 1–3 days per system, that is the window attackers have to move freely in your environment before you know the scope of the compromise.
- Read Anthropic's published research on AI-accelerated offensive operations before evaluating any AI DFIR tooling. Understanding the offensive use case — specifically the compression of multi-day attack sequences to minutes — is prerequisite context for making sound architectural decisions on the defensive side.
- Frame AI-assisted DFIR adoption internally as a parity requirement, not an efficiency gain. The argument that lands with security leadership is not "this saves analyst time" — it is "offensive teams are already operating at this speed, and we are not."
Common Pitfalls
- Treating AI DFIR tooling as a replacement for analyst expertise. Rob is explicit that the reports Claude Code generates require a formal understanding of digital forensics to interpret correctly — "only someone who knows digital forensics and incident response can comprehend the report." Deploying AI DFIR without trained analysts reviewing output is a misapplication of the technology.
- Underestimating the speed of AI-accelerated offensive operations because the published Anthropic research feels abstract. The concrete anchor Rob provides is that tasks taking days-to-weeks manually now take seconds-to-minutes with AI tooling. Security teams that dismiss this as theoretical are making a planning assumption that is already falsified by documented adversary behavior.
Architecture of AI-Assisted DFIR: CLAUDE.md, Skills Files, and SIFT Integration
The CLAUDE.md Prime Directive: Orchestrating Deterministic AI Behavior
The foundation of AI-assisted digital forensics incident response in Rob’s architecture is a single Markdown file: CLAUDE.md. This file acts as the prime directive — the orchestrator that governs every decision Claude Code makes during a forensic engagement. Understanding its role is essential before deploying any Claude Code DFIR automation setup.
The CLAUDE.md file does several critical things simultaneously:
- Hard-codes tool paths: Rather than letting Claude Code guess where tools live on the SIFT Workstation,
CLAUDE.mdexplicitly defines the path to each binary. This eliminates ambiguity and prevents the agent from hallucinating incorrect tool invocations or wasting context cycles searching the filesystem. - Sets behavioral directives: Rob’s configuration includes the explicit instruction: “Run end-to-end. Don’t ask for directions. Just move through the overall aspect of what I’m trying to get it to do.” This single directive transforms Claude Code from an interactive assistant into an autonomous analysis engine.
- Defines the mission context: The file gives Claude Code a clear understanding of what a DFIR engagement looks like — what success means, what a final report should contain, and what the scope of analysis covers.
- Establishes self-correction on error: A key directive instructs Claude Code that if it receives an error, it should begin self-correction autonomously rather than stopping and waiting for human input.
Think of CLAUDE.md as the difference between hiring a contractor who asks a question before every action versus one who has a clear brief, knows the tools, and delivers results. The file is what makes the difference between an interactive chatbot and an autonomous agentic AI incident response system.
Skills Files: Teaching Claude Code the SIFT Toolset
The SIFT Workstation contains an enormous breadth of DFIR tools — AMCache parsers, prefetch analyzers, event log processors, memory forensics frameworks, timeline builders like Plaso/log2timeline, and more. Claude Code has no native knowledge of how these tools accept parameters, what their output formats look like, or how to chain them together meaningfully.
This is where skills.md files solve a critical problem. Each per-capability Markdown file provides:
- Goals: What this tool is designed to accomplish
- How it works: A plain-language description of the tool’s function within a DFIR workflow
- Execution guidance: How to invoke the tool, which flags matter, and what outputs to expect
Rob is explicit that these are not full MCP servers[4] — they are lightweight Markdown reference files. The distinction matters: instead of building a complex protocol layer, you give Claude Code enough structured context to route correctly to each tool when a natural language request maps to that capability. When you say “do memory analysis,” Claude Code reads the relevant skills file, understands the appropriate tool, constructs the correct command, and executes it.
This architecture also serves as a context rot mitigation strategy. By providing a concrete, structured reference path at the start of the session, you reduce the number of exploratory reasoning steps Claude Code needs to take mid-engagement — preserving context window budget for actual analysis.
How Claude Code Bootstrapped Its Own Tool Knowledge
One of the most operationally significant aspects of Rob’s setup is that he did not manually author every skills file. Instead, he used Claude Code itself as the authoring agent. The bootstrap process worked as follows:
- Provide Claude Code with a list of all tools on the SIFT Workstation
- Instruct it to run the man page for each tool —
man <toolname>— and extract the command-line flags, purpose, and usage patterns - If no man page exists, instruct it to run the tool directly and observe the output to infer the CLI interface
- Search online documentation for any supplementary information not captured by man pages
- Red team each tool: verify it actually runs accurately before including it in a skills file
- Write the skills.md file for each tool based on the accumulated knowledge
This entire process — from a blank slate to a fully populated skills library — took approximately 90 minutes of wall-clock time. Rob ran this rebuild the day before the conference to verify the approach still worked from scratch. The implication for engineers is significant: the configuration overhead for deploying SIFT Workstation AI integration is measured in hours, not weeks.
The bootstrap also demonstrates a meta-capability of agentic AI incident response: Claude Code can reason about its own knowledge gaps, identify what it needs to learn, and acquire that knowledge autonomously. This is the same agentic loop that makes the tool powerful for forensics — applied to the problem of configuring the tool itself.
The Single Natural Language Command: How Analysis Is Invoked
Once CLAUDE.md and the skills files are in place, the entire DFIR engagement runs from a single natural language prompt. Rob demonstrated this live:
find evil and write a comprehensive report in PDF
That is the complete human input for a full forensic investigation covering disk images, memory images, prefetch data, event logs, AMCache, and timeline reconstruction. Claude Code interprets this natural language instruction against the context in CLAUDE.md, maps the work to the appropriate skills files, selects the correct SIFT tools, executes them in sequence, and synthesizes the results into a structured report.
The workflow Claude Code follows autonomously:
- Enumerate available evidence: Identify all disk images and memory images in the working directory
- Route to appropriate tools: Match each evidence type to the corresponding SIFT Workstation tool via skills files
- Execute tool chains: Run AMCache analysis, prefetch analysis, event log parsing, memory forensics (via Volatility[5]), and timeline reconstruction via Plaso/log2timeline
- Correlate findings: Cross-reference artifacts across tool outputs to identify attacker behavior patterns
- Map to MITRE ATT&CK: Overlay findings onto the ATT&CK framework automatically — Rob did not explicitly request this; it happened because the agent inferred it was part of a comprehensive forensic report
- Generate the final report: Produce a structured PDF covering executive summary, attack chain, malware inventory, persistence mechanisms, lateral movement indicators, network IOCs, and remediation recommendations
The MITRE ATT&CK overlay is worth highlighting specifically: Rob did not instruct Claude Code to produce ATT&CK mappings. The agent inferred — based on its understanding of what a professional forensic report looks like — that ATT&CK context was expected. This is an example of the system exceeding the explicit prompt because the surrounding context (CLAUDE.md, skills files, and the agent’s general knowledge of DFIR reporting standards) primed it to produce a more complete output.
The Role of MCP and Why Rob Chose a Lighter Path
The Model Context Protocol (MCP) is the more structured architectural option for exposing tool parameters, outputs, and failure modes to Claude Code. Rob references it as the target architecture for the community hackathon’s Track 1 (forensic MCP engineering), acknowledging it as a more robust long-term solution.
However, his proof-of-concept deliberately chose the lighter skills.md approach for a specific reason: speed of iteration. Building proper MCP servers for every SIFT tool would require significantly more engineering effort — defining schemas, handling protocol-level communication, managing server lifecycle. For a proof-of-concept meant to demonstrate feasibility in a constrained timeframe, Markdown skills files provided 80% of the benefit at 10% of the implementation cost.
This architectural decision also has a practical lesson for engineers evaluating adoption: you can start with skills files today and migrate to MCP servers as your deployment matures. The CLAUDE.md orchestrator remains unchanged; only the tool knowledge layer is upgraded.
CLAUDE.md Bootstrap: Teaching Claude Code SIFT Tool Capabilities via Man Pages and Self-Testing
Proof of Concept
-
Install Claude Code on the SIFT Workstation with root access. The SIFT Workstation is an 18-year-old SANS Institute open-source DFIR tool suite containing AMCache analyzers, prefetch parsers, event log processors, memory forensics frameworks (e.g., Volatility), and timeline tools (e.g., Plaso/log2timeline). Giving Claude Code root is a prerequisite — it needs to execute tools, read output, and write configuration files.
-
Issue a natural-language bootstrap instruction to Claude Code. Rob’s instruction was roughly: “Go through all the tools on the SIFT Workstation. Look at the man pages. If there’s no man page, run the tool and inspect the command-line flags. If there’s still no documentation, look it up online. Then red-team each tool — make sure it actually runs accurately.” No scripts, no explicit enumeration — Claude Code used its own agentic reasoning to discover and enumerate the toolset.
- Claude Code iterates tool-by-tool through the SIFT toolset. For each tool:
- It runs
man <tool>and parses the output to extract purpose, flags, expected inputs, and output formats. - If no man page exists, it runs the tool directly (e.g.,
<tool> --helpor bare invocation) and captures the CLI flag listing. - For tools with neither man page nor self-documenting flags, it performs a web lookup using available documentation, README files, or public guides.
- It then red-teams each tool: runs it against a test input to confirm it executes correctly and produces parseable output.
- It runs
- Claude Code authors the CLAUDE.md prime directive. Based on the tool knowledge acquired in step 3, it writes a CLAUDE.md file with these key directives embedded:
- Hard-coded tool paths: Eliminates guessing — Claude Code is told exactly where each binary lives (e.g.,
/usr/bin/vol.py,/opt/plaso/bin/log2timeline.py). This enforces deterministic execution rather than PATH-dependent resolution. - Behavioral rules: “Run end-to-end without asking for directions.” This suppresses the interactive confirmation prompts that would otherwise pause the analysis pipeline.
- Self-correction on error: “If you receive an error, perform self-correction.” This directive tells Claude Code to diagnose failures autonomously, retry with corrected parameters, and continue rather than halting.
- Mission context: The overall objective — find evil, produce a comprehensive MITRE ATT&CK-mapped report — is stated explicitly.
- Hard-coded tool paths: Eliminates guessing — Claude Code is told exactly where each binary lives (e.g.,
- Claude Code authors per-capability skills.md files. Each skills.md contains:
- The goal of the forensic task
- Which SIFT tool to invoke
- The specific command-line flags and input/output patterns
- A “path to success” — the logical sequence of steps to complete the task
-
Validate the bootstrap with a live analysis run. After approximately 90 minutes of self-training, Rob issued a single natural-language command: “Find evil and write a comprehensive report in PDF.” Claude Code — guided entirely by the CLAUDE.md prime directive and skills.md files — autonomously selected tools, executed them in sequence, correlated outputs across AMCache, prefetch, event logs, and memory, and produced a MITRE ATT&CK-mapped forensic report. The memory image demo completed in 18 minutes; the C drive analysis of the Stark Research Labs scenario completed in 14 minutes 27 seconds.
- Limitations of this approach: The bootstrap process is described at a high level — the natural-language instructions Rob gave Claude Code, the general tool-discovery approach (man pages → run tool → web lookup → red-team), and the structural output (CLAUDE.md + skills.md files). The exact content of the CLAUDE.md and skills.md files, the full enumerated list of SIFT tools covered, and the precise self-correction logic embedded are not publicly detailed in the talk itself. Rob notes he is releasing these files as resources alongside the hackathon launch.
Actionable Takeaways
- Build your CLAUDE.md with three non-negotiable directives before any forensic run: hard-coded tool paths (eliminate guessing), a run-end-to-end behavioral rule (prevent interactive pauses), and a self-correction-on-error instruction (maintain autonomous operation). These three lines are the difference between an assistant and an autonomous analysis engine.
- Bootstrap your skills files using Claude Code itself: provide it with your tool list, instruct it to read man pages, run each tool to verify flags, check online docs for gaps, and write the skills.md files. Budget 90 minutes for a full SIFT Workstation bootstrap. This investment is fully reusable across all future engagements.
- Start with lightweight Markdown skills files rather than full MCP servers if you are evaluating this approach for the first time. Skills files are faster to produce, easier to iterate on, and provide sufficient context for Claude Code to route correctly to SIFT tools. Migrate to MCP engineering once the approach proves out in your environment.
Common Pitfalls
- Omitting the "run end-to-end without asking for directions" directive from CLAUDE.md. Without this, Claude Code will pause for confirmation at each decision point — destroying the speed advantage and requiring constant human presence. The entire value of the architecture depends on autonomous execution once the investigation is launched.
- Relying on Claude Code to discover tool paths dynamically rather than hard-coding them in CLAUDE.md. Dynamic discovery wastes context window budget on filesystem exploration, increases the risk of the agent invoking a wrong binary, and adds latency to every tool invocation. Deterministic paths are not a limitation — they are the architecture's reliability guarantee.
Memory Forensics and Disk Image Analysis: What the AI Actually Found
Two Live Demonstrations: Memory Image and C Drive Analysis
The core proof behind AI-assisted digital forensics incident response is not theoretical — Rob T. Lee ran two live demonstrations at [un]prompted 2026 that produced independently verifiable results. Both are drawn directly from real DFIR artifacts: a memory image analyzed in approximately 18 minutes, and a full C drive image from the Stark Research Labs scenario analyzed in 14 minutes 27 seconds. Since Rob staged the compromise himself, he could authoritatively confirm 100% accuracy against known ground truth.
Demo 1 — Memory Image Analysis: 18 Minutes to Full Report
The memory forensics demonstration starts with a single natural-language prompt: “find evil in the memory image.” No tool flags, no explicit command sequence — just that instruction handed to Claude Code running on the SIFT Workstation.
What follows is a fully autonomous analysis run. Claude Code, guided by the CLAUDE.md prime directive and per-capability skills.md files, determines which memory forensics tools to invoke (including Volatility-class capabilities built into SIFT), selects the correct flags, executes the commands, and synthesizes the output into a structured report. The total elapsed wall-clock time is approximately 18 minutes — with the caveat that Rob had not yet suppressed the interactive confirmation prompts (“press yes to continue”), which introduced manual delay. With autonomous continuation enabled, the same workflow runs faster.
The report output includes:
- Threat actor context and attack chain reconstruction — the AI determined initial execution method (WMI-based) without being told to look for it
- Malicious binary identification —
p.exemasquerading in a temp directory was flagged with full path and behavioral context - System profile — OS, architecture, and running process inventory automatically derived
- Code injection analysis — injected code regions identified across process memory
- Network connections and C2 identification — active and historical connections mapped to command-and-control infrastructure
- Full malicious process tree — parent-child process relationships reconstructed
- Chronological activity timeline — Rob explicitly noted he did not instruct the AI to produce a timeline; it generated one because the CLAUDE.md context and tool familiarity made it the logical output
- False positive identification — the AI flagged
f-response(a forensic imaging agent) as a suspicious process and correctly self-identified it as part of the legitimate forensic collection workflow - Remediation recommendations — actionable steps generated without being prompted
The false positive call-out on f-response is notable: it demonstrates that the system applies genuine reasoning rather than simple pattern matching. It found something that looked anomalous, surfaced it, and provided enough context for an analyst to make the final determination. That is exactly how a competent junior analyst would behave.
Demo 2 — Stark Research Labs C Drive: 14 Minutes 27 Seconds, One Command
Proof of Concept
-
Scenario setup — Stark Research Labs breach: The fictitious target organization is Stark Research Labs, a network familiar to SANS FOR508 students. The breach narrative: lead researcher Tim Dunan inadvertently signaled project completion on Twitter (“Our carbon lab just consumed 9 gigawatts of energy”), attracting adversary group Crimson Osprey (combining tactics of APT Hydra and APT Hammer — both fictional). Impact included critical infrastructure exchange mail server and web server forced offline, key administrative accounts compromised, and deep system persistence established. The incident responder objective: scope the compromise, hunt persistence mechanisms, track lateral movement, and assess exfiltration — all from a single C drive image.
-
Pre-analysis configuration (90-minute one-time cost): Before issuing any analysis command, Rob spent approximately 90 minutes building the CLAUDE.md prime directive and per-tool skills.md files. The CLAUDE.md file acts as the orchestrator, hard-coding deterministic tool paths so Claude Code never has to guess binary locations, setting behavioral rules (run end-to-end without asking for user confirmation or direction), and embedding a self-correction directive so that on any tool error the agent retries automatically rather than stopping.
- Single-command invocation: With the C drive image mounted and the CLAUDE.md and skills.md files in place, Rob typed the only human input for the entire analysis session:
find evil in [image path] write comprehensive report in PDFNo additional prompts, no interactive direction, no tool guidance. The agent interpreted this natural-language command, consulted its CLAUDE.md prime directive and relevant skills.md files, and began autonomous execution.
-
Autonomous multi-tool forensic execution (16x speed playback shown): Claude Code autonomously executed the full DFIR toolkit available on the SIFT Workstation against the C drive image. The agent performed AMCache analysis to identify recently executed binaries, prefetch analysis to recover program execution history, Windows Event Log analysis for logon events and privilege escalation indicators, timeline reconstruction using Plaso/log2timeline to correlate artifacts into a unified chronological sequence, and file system artifact triage to identify suspicious binaries and staging directories. Every tool invocation, output file, and command log was preserved, providing a verifiable chain of evidence.
-
Elapsed time — 14 minutes 27 seconds: The complete analysis of the C drive image finished at 14 minutes 27 seconds wall clock time. Rob noted the only reason the memory image demo took 18 minutes was the need to manually press “continue” at agent checkpoints — a limitation he had already resolved by embedding a “run end-to-end, do not ask for directions” directive in CLAUDE.md.
-
Report output — 100% accuracy verified: The automatically generated PDF report contained: an executive summary, a chronological attack timeline, a full malware inventory including a malicious binary (
p.exe) masquerading in a temp directory, identified persistence mechanisms, PowerShell transcript evidence, network indicators of compromise, a MITRE ATT&CK technique overlay mapped to observed adversary behavior, lateral movement artifacts, and remediation recommendations with prioritized steps. Rob verified the report’s accuracy personally, having staged the compromise himself. -
Cross-system context acceleration: Once Claude Code processes one system in a multi-system engagement, it retains contextual knowledge of the threat actor’s TTPs — malware names, persistence patterns, command-and-control indicators. Analysis of system 2, system 3, and beyond accelerates naturally as the agent is already primed to look for contextually related artifacts.
- Known limitation — context rot: The primary technical caveat Rob identified is context rot: as the agent processes more data within a single session, earlier context degrades in accuracy. For a single-system case like this demonstration, the practical mitigation is simply restarting the Claude Code session if the agent shows signs of drift.
Cross-System Context Carry-Over: Speed Compounds Across an Engagement
One operationally significant finding from the experiment is how the system behaves when working across multiple systems in the same engagement. Once Claude Code has analyzed system 1 and identified a malicious binary or intrusion pattern, that context persists within the session. When analysis moves to system 2, the AI begins looking for the same indicators contextually — it is not starting from zero.
Rob described this directly: “If you find malicious code on one system, it’s assuming it’s looking for something contextual related on system 2, system 3, system 4. So the speed naturally increases as you’re working through the different exercise.”
This means the 14-minute per-system figure is a ceiling, not a floor, for multi-system IR engagements. The more systems analyzed in a single session (up to the context limit), the faster subsequent analysis proceeds because the AI has already internalized the adversary’s TTPs, tooling, and naming conventions.
What “100% Accurate” Actually Means
The accuracy claim warrants precise framing. Rob acknowledged in Q&A that an audience member raised a fair question: could Claude Code be inferring findings by looking up case details online rather than deriving them purely from tool output? His response was that every command executed and every output file generated is logged, so analysts can verify the provenance of each finding. The detail density in the report — specific binary paths, registry key names, exact timestamps — strongly implies tool-derived output rather than inference from public knowledge. But he acknowledged the question is worth testing rigorously, and suggested running the analysis in an air-gapped environment as a validation step.
The practical implication is that even if some degree of online inference is occurring, the command log provides a full audit trail. Engineers deploying this in a real IR engagement should review the command log as standard practice to confirm that every finding traces to a local artifact.
Actionable Takeaways
- Validate AI-generated forensic reports against the command execution log — every DFIR finding should trace to a specific tool invocation and its output file. If a finding cannot be linked to a logged command, treat it as unverified until confirmed by a second analyst.
- Run memory and disk image analysis in sequence within the same Claude Code session when working a multi-system engagement. Cross-system context carry-over means the AI will apply TTPs, binary names, and indicators found on system 1 to accelerate and focus analysis on subsequent systems — do not restart the session between hosts unless context rot forces it.
- Use the MITRE ATT&CK overlay output as an immediate triage handoff to threat intelligence teams. The AI generates ATT&CK mappings unsolicited when the CLAUDE.md context establishes professional report standards — this output is ready for consumption by detection engineers writing or tuning SIEM rules without additional manual mapping work.
Common Pitfalls
- Accepting the AI report as ground truth without reviewing the command log. The 100% accuracy claim was validated because the analyst staged the compromise and personally knew every artifact planted. In a real engagement, the command log is the only objective verification mechanism — skipping that review removes the audit trail that distinguishes tool-derived findings from inference.
- Allowing interactive confirmation prompts to remain enabled during analysis. Rob's 18-minute memory analysis was extended because he had to manually press "yes to continue" at each checkpoint. Leaving this behavior enabled in production wastes the core speed advantage — the CLAUDE.md prime directive should include a directive to run end-to-end without pausing for confirmation.
Context Rot in Agentic DFIR: Causes, Mitigations, and Enterprise Implications
What Is Context Rot and Why It Matters for AI-Assisted DFIR
Context rot is the progressive degradation of an LLM agent’s accuracy and coherence as its context window fills over the course of a long-running session. In agentic digital forensics incident response, this is not a theoretical concern — it is the primary limiting factor identified by Rob T. Lee after demonstrating full DFIR completion in under 15 minutes. As the AI processes more tool outputs, reads more log files, and accumulates evidence across multiple analysis passes, older context — including critical early findings — is effectively pushed out or deprioritized.
Rob was direct about the current state: “The more you repeat it and repeat it, repeat it, repeat it, over time it is going to slowly become more dumb just because of the fact how that you know functionality works.” This is not a model failure — it is a fundamental property of fixed-size context windows applied to unbounded forensic workloads.
Causes: Why DFIR Sessions Are Especially Vulnerable
Several characteristics of forensic analysis make context rot worse than in typical LLM use cases:
- Volume of tool output: Each SIFT tool invocation — AMCache parsing, prefetch analysis, event log processing, Volatility memory scans — produces verbose structured output. A single full-disk analysis can generate tens of thousands of tokens of tool results.
- Sequential reasoning dependency: DFIR analysis is inherently chained. A persistence mechanism found in hour one shapes what to look for in lateral movement analysis in hour two. If early findings degrade in context, downstream reasoning loses its anchor.
- Multi-system scaling: In enterprise IR, you are rarely analyzing one host. Searching for a known-malicious binary like
p.exeacross prefetch artifacts on 1,000 endpoints means the agent must carry forward threat context from system 1 to system 999. Rob acknowledged this explicitly: “that scale is going to create context rot pretty quickly.” - Research-confirmed regression: An audience member at the talk noted that even with million-token context windows, accuracy degrades noticeably after the first turn, and compacting to ~100,000 tokens is often necessary to maintain coherent reasoning.
Mitigations Built Into the Current Architecture
Rob implemented two concrete mitigations within the CLAUDE.md orchestrator file:
1. Self-correction directives
The CLAUDE.md prime directive includes an explicit instruction: if the agent receives an error, it should begin performing self-correction autonomously rather than halting and waiting for human input. This keeps the session moving and reduces dead-end states that consume context tokens without producing useful output. As Rob described it: “You could ask it and build into the contextual analysis and have something continually running behind the scenes that says if you receive an error I would like you to start performing self-correction.”
2. Skills files as external memory anchors
The per-tool skills.md files serve a dual purpose: they teach Claude Code how to invoke SIFT tools, and they act as a recoverable reference that the agent can re-consult if it loses track of how a particular tool works mid-session. By externalizing tool knowledge into Markdown files rather than relying on the model to hold it all in working context, Rob gives the agent a path back to correct behavior even after significant context accumulation.
3. Session restart for single-system cases
For standard single-endpoint forensics, the simplest and most reliable mitigation is also the most obvious: start a new session. Claude Code itself will signal when context is running low (“hey we ran out of context start a new chat”), and for single-image analysis this is an acceptable boundary. Rob confirmed this works well in practice: “on a single system like standard forensics, no problem, just restart.”
The Enterprise Blocking Problem
Session restart is not a viable strategy at enterprise scale. Rob was explicit about where this becomes a blocker: “if we really want something enterprise-worthy, it’s going to have to be able to maintain its current data set across a longer period of time.”
Consider a realistic IR scenario: a suspected supply chain compromise affecting 500 endpoints. The AI agent identifies a persistence mechanism on host 1 — a specific registry key combined with a scheduled task and a DLL side-load chain. For the agent to meaningfully accelerate analysis of hosts 2 through 500, it needs to carry that threat signature forward as active working memory. Every new session restart resets that state, forcing either redundant re-analysis or human intervention to re-inject context.
The problem compounds when you layer in threat intelligence feeds, historical case data, and cross-system timeline correlation — all the capabilities Rob described as the natural evolution of this platform. Those use cases require persistent, coherent state management that current LLM context windows do not support natively.
Open Research Questions
Rob did not present a solved solution for enterprise-scale context rot — he framed it as the second track of the community hackathon precisely because it remains an open problem. The core questions are:
- How do you compress and serialize investigation state between sessions without losing forensically significant detail?
- Can structured memory stores (e.g., a graph of confirmed IOCs, ATT&CK technique mappings, and system relationships) replace in-context working memory for cross-system analysis?
- What is the minimum viable context payload that allows a restarted agent session to resume a multi-host investigation without starting from scratch?
These are not purely technical questions — they have direct implications for evidence integrity and chain of custody in environments where forensic output may be used in legal proceedings or regulatory responses.
Practical Guidance for Engineers Evaluating Deployment
For security teams considering whether to deploy this capability in production today, the context rot limitation maps to clear deployment boundaries:
- Single-system triage: Current capability is production-viable. Session restarts are manageable and the 14-minute analysis window is well within a single context budget.
- Small-scale IR (2–10 systems): Viable with careful session management. Pre-load threat context from system 1 findings into the CLAUDE.md or a hand-off prompt before starting each subsequent session.
- Enterprise-scale IR (100+ systems): Not production-ready without additional architectural work — persistent memory, MCP server integration, or external state management. This is the hackathon’s Track 2 target.
Actionable Takeaways
- Add an explicit self-correction directive to your CLAUDE.md orchestrator file: instruct the agent to autonomously retry and recover from tool errors rather than halting. This reduces wasted context tokens on dead-end failure states and keeps long-running sessions productive.
- For single-system forensic triage, treat session restart as a first-class workflow step rather than a failure condition. Design your CLAUDE.md to produce a structured handoff summary (confirmed IOCs, timeline anchors, ATT&CK techniques mapped) that can be pasted as context into the next session, minimizing re-analysis cost.
- Before deploying AI-assisted DFIR at enterprise scale, prototype context state serialization: after each system analysis, have the agent write a compact structured summary of confirmed findings to an external file. Test whether injecting that summary at the start of the next session preserves enough reasoning continuity to accelerate cross-system analysis — this is the core architectural gap the community hackathon is targeting.
Common Pitfalls
- Assuming that a larger context window eliminates context rot. Research and practitioner experience (including audience feedback at the talk) confirms that accuracy degrades after the first turn even with million-token windows. Compacting context aggressively and restarting sessions is more reliable than relying on window size alone.
- Deploying multi-system agentic IR without a state persistence strategy and expecting the agent to maintain coherent threat context across all hosts. Without an external memory mechanism, each system analysis effectively starts blind to prior findings, negating the cross-system acceleration benefit Rob demonstrated where "it'll remember aspects of that initial intrusion" on subsequent systems.
Community Hackathon and the Path to Enterprise-Grade AI DFIR
The Asymmetric Advantage: Community vs. Adversarial Development
The proof of concept Rob demonstrated is compelling, but he was explicit that a proof of concept is not the same as an enterprise-grade capability. The gap between a weekend experiment and a tool that security teams can trust in production is exactly what the next phase targets. What Rob identified — and what the community hackathon is designed to exploit — is a structural asymmetry that defenders hold over adversaries.
Adversarial development teams, whether nation-state APT groups or criminal organizations, generally work in secret, operate with small teams, and don’t share tooling with rival groups. The open source community operates on entirely different principles: thousands of contributors, public code, shared iteration, and logarithmic acceleration when critical mass is reached.
Rob drew a direct analogy to the trajectory of a well-known AI assistant that was built over a single weekend in late November, gained traction in January, and then exploded in adoption as the developer community pile-drove improvements into it simultaneously. The SIFT Workstation itself followed a similar path — what started as an 18-year educational project became the standard DFIR toolkit because the community collectively bet on it. The hypothesis is that the same dynamic can be applied to AI-assisted DFIR.
SANS Community Hackathon: Two Tracks, $22,000 in Prizes
Rob announced a formal SANS-sponsored hackathon running from April 1 through May 15, with a stated goal of producing a trust-worthy, enterprise-deployable AI DFIR capability by June 1. The total prize pool stands at $22,000:
- First place: $10,000 — individual or team with the most significant development contribution
- Second place: $7,500
The hackathon is structured around two tracks that directly address the known gaps in the current proof of concept:
Track 1 — Forensic MCP Engineering
Claude Code does not natively understand the parameters, outputs, and failure modes of SIFT Workstation tools. Rob’s current implementation bridges this with skills.md Markdown files — a functional but limited approach. The goal of Track 1 is to take the proof of concept and accelerate it 10x to 100x by building proper forensic MCP (Model Context Protocol) servers. A well-engineered MCP server exposes tool capabilities in a structured, programmatic way that eliminates guessing, handles error modes deterministically, and scales across tool sets without inflating context.
Track 2 — Context Rot Mitigation
The second track directly targets the primary known limitation discussed in the prior section. How do you maintain coherent analytical state across a long-running investigation that spans multiple systems, multiple tool passes, and potentially thousands of tokens of intermediate output? Rob framed this as an open research problem — he knows session restarts are not a viable answer at enterprise scale, and he’s inviting the community to solve it.
Resources Available to Participants
Rob committed to providing all materials needed to replicate and extend the work:
- Disk images and memory images from the Stark Research Labs scenario used in the demos
- SIFT Workstation (open source, freely available)
- Exemplar case submissions — Steve Anson from SANS has built his own implementation that participants can review as a reference architecture for what a solid deployment looks like
- CLAUDE.md prime directive files and skills.md files from the proof of concept build
- A NotebookLM[6] resource that community members can query for guidance on replicating the setup
- A dedicated website (domain registered at time of talk, publishing imminent) containing the presentation, all resources, and a walkthrough of the architecture
Rob also indicated he was in contact with Anthropic to potentially provide API token grants to participants, lowering the cost barrier for development and testing at scale.
Why This Matters: Matching AI Speed on the Defensive Side
The strategic framing Rob returned to throughout the talk is that defenders are operating at a structural speed disadvantage as long as offensive teams use AI-assisted automation and defensive teams do not. Anthropic’s own published research documented adversarial use of Claude Code to compress attack timelines from days to seconds. A manual DFIR workflow that takes one to three days per system cannot keep pace with an adversary that pivots in minutes.
The proof of concept demonstrated 14 minutes 27 seconds for a full C drive analysis and 18 minutes for memory forensics. Those numbers already represent a significant compression. But the current implementation has known failure modes — context rot at scale, lack of formal MCP tooling, and the trust gap that comes with any proof-of-concept system. The hackathon is the mechanism for closing that gap with community velocity rather than waiting for a single team to do it in isolation.
The message is direct: the tools are open source, the resources are available, the problem is defined, and the prize pool is funded. The challenge to the community is whether a production-grade AI DFIR capability can be delivered in a single sprint.
Actionable Takeaways
- Register for the SANS hackathon (April 1 – May 15) and contribute to either Track 1 (forensic MCP engineering for SIFT tools) or Track 2 (context rot mitigation). Even partial contributions — a single well-engineered MCP server for one SIFT capability, or a documented context management pattern — advance the state of the field and accelerate the community toward a June 1 enterprise-deployable target.
- Use Steve Anson's exemplar implementation as a reference architecture before building from scratch. Rob explicitly flagged it as a strong example of what a solid backend looks like. Starting from a known-good implementation saves the time otherwise spent on first-principles architecture decisions and lets you focus on the specific improvement you're contributing.
- If you are evaluating whether to pilot AI-assisted DFIR in your organization, frame the current proof of concept accurately: it is a validated PoC with known limitations (context rot at scale, no formal MCP tooling), not a production system. Use the hackathon output — expected by June 1 — as the evaluation point for enterprise deployment decisions, and in the interim, test the existing setup on isolated single-system cases where session restarts are acceptable.
Common Pitfalls
- Treating the proof of concept as production-ready before the community hardening work is complete. Rob was explicit that an hour-and-a-half configuration producing impressive results is not the same as a system with the trust characteristics needed for enterprise incident response. Deploying the current implementation at scale — across hundreds of systems in a live IR engagement — will surface context rot failures and tool parameter guessing errors that the hackathon tracks are specifically designed to eliminate.
- Assuming adversarial teams face the same context rot and tooling limitations that currently constrain the defensive proof of concept. Offensive teams using Claude Code for automated operations are actively iterating on these same problems and are not waiting for the defensive community to catch up. The speed asymmetry is the core threat model, and it only narrows if the community treats this as an active engineering problem rather than a research curiosity.
Conclusion
Rob T. Lee’s demonstration at [un]prompted 2026 draws a clear line in the sand for security operations teams: the speed gap between AI-accelerated offensive operations and manual defensive DFIR is real, documented, and closeable. The architecture — a CLAUDE.md prime directive, per-tool skills.md files, and Claude Code with root access on the SIFT Workstation — is not conceptual. It ran a full forensic investigation of a C drive image, found a malicious binary, mapped persistence mechanisms, reconstructed lateral movement, and generated an ATT&CK-mapped PDF report in 14 minutes 27 seconds.
The honest caveat is that production readiness requires more. Context rot at scale, the absence of formal MCP tooling, and the trust gap of any proof-of-concept system are real constraints — not dismissible ones. The SANS community hackathon (April 1 – May 15, $22,000 prize pool) is the structured mechanism for closing those gaps with community velocity. Whether the June 1 enterprise-deployable target is hit depends on whether the defender community mobilizes with the same urgency that offensive teams are already operating with.
For engineers deciding what to do now: replicate the single-system capability, validate it against your own DFIR test cases, and treat the hackathon output as your enterprise deployment evaluation window. The tools are free, the methodology is documented, and the competitive pressure to move fast is as documented as any threat intelligence report you’ve read this year.
Related talks worth reading alongside this one:
- Agentic AI security — how autonomous AI systems create new attack and defense surfaces
- Defensive security — the broader landscape of detection, response, and hardening methodologies
- Incident response — frameworks, tooling, and methodologies for scoping and containing compromises
References & Tools
- Claude Code — Anthropic's AI coding agent; given root access on the SIFT Workstation to serve as the agentic orchestrator for autonomous DFIR tool execution and report generation. ↩
- Plaso / log2timeline — Timeline creation and super-timeline tool for correlating forensic artifacts (AMCache, prefetch, event logs) into a unified chronological sequence. ↩
- SIFT Workstation — SANS Institute's open-source DFIR tool suite; the 18-year-old execution environment containing AMCache analyzers, prefetch parsers, event log processors, memory forensics frameworks, and timeline tools. ↩
- Model Context Protocol (MCP) — A structured server protocol for exposing tool parameters, outputs, and failure modes to AI agents in a programmatic and reliable way; identified as the target architecture for hackathon Track 1. ↩
- Volatility — Open-source memory forensics framework present in the SIFT Workstation; used by Claude Code to identify malicious processes, code injection, and network connections in memory image analysis. ↩
- NotebookLM — Google's AI-powered research assistant; Rob built a NotebookLM resource as a supplementary Q&A interface so community members can ask questions about replicating the SIFT + Claude Code setup. ↩
Questions from the audience
Related deep dives
Glass-Box Security: Operationalizing Mechanistic Interpretability | [un]prompted 2026
Detecting GenAI Threats at Scale with YARA-Like Semantic Rules
You Are Not Netflix- How to learn from conference talks