The Cyber Archive

Hackuracy: Boosting AST Accuracy Through Hacking

Learn how a 10-month experiment quantified AST accuracy in application security testing — and why the best automated scanner scored just 36.9% F1.

AR
Deep dive of a talk by
Andres Roldan
13 February 2026
4709 words
26 min read

Andres Roldan presenting talk - Hackuracy: Boosting AST Accuracy Through Hacking at OWASP Global AppSec USA 2024
Andres Roldan presenting talk - Hackuracy: Boosting AST Accuracy Through Hacking at OWASP Global AppSec USA 2024

The best automated vulnerability scanner on the market detected less than 10% of the real risk exposure identified by manual penetration testing in the same application — yet most organizations treat scanner output as ground truth. AST accuracy in application security testing is not a given; it must be measured, benchmarked, and understood before you trust it with your application security posture.

This post breaks down Fluid Attacks’ 10-month empirical study comparing manual pen testing against automated AST tools using precision, recall, and F-score metrics — and introduces CVSSF[2], an exponential scoring model that makes risk aggregation meaningful in ways raw CVSS[1] scores cannot.

Key Takeaways

  • You'll learn why hacking must be treated as a measurement tool — and how to evaluate its accuracy using precision, recall, and F-score metrics — so you can make data-driven decisions about your security posture.
  • You'll be able to identify the critical limitations of CVSS-based vulnerability comparisons and apply CVSSF (an exponential scoring model) to aggregate and track real risk exposure over time.
  • Apply this to choose the right security testing approach: manual penetration testing detected 84% of vulnerabilities with zero false positives, while the best automated scanner only reached 36.9% F1-score — understanding this tradeoff prevents false confidence in automated-only programs.

Hacking as a Measurement Tool: Defining Accuracy in Application Security Testing

Reframing Hacking as a Measurement Instrument

Most organizations treat penetration testing and vulnerability scanning as services they consume — something to check off a compliance requirement or satisfy a procurement checklist. But AST accuracy in application security testing depends on recognizing that hacking is fundamentally a measurement tool. And like any measurement tool, its value is only as good as its accuracy.

When a company hires a pen testing team or deploys an automated scanner, the core question being answered is: how secure — or insecure — is this application? That is a measurement problem. The output is a reading. And just as a miscalibrated thermometer gives you the wrong temperature, an inaccurate security testing program gives you the wrong picture of your risk.

This reframing has immediate practical consequences. If hacking is measurement, then the question becomes: what makes a measurement accurate?

The Two Axes of Hacking Accuracy

Accuracy in security testing breaks down into two distinct failure modes — and a tool or service must perform well on both to be considered reliable:

1. No false positives — don’t tell lies

A security tool should only report vulnerabilities that actually exist. If a scanner reports 50 vulnerabilities on a system that has 40, there are at least 10 false positives — 10 lies in the report. False positives waste remediation time, erode trust in the tool, and cause security teams to deprioritize real findings buried in noise.

2. No false negatives (no omissions) — report everything that’s true

A security tool must detect all (or most) of the real vulnerabilities present. If a system has 40 vulnerabilities and the report only surfaces 35, there are 5 omissions — 5 false negatives. Those missed findings remain in production, unpatched, silently accumulating risk.

The analogy from the talk is instructive: imagine a medical report on your moles. If you have 40 moles and the report says 50, there are lies — false positives. If the report says 35, there are omissions — false negatives. A good medical report (or a good security test) must find everything real and nothing fake.

Why This Matters for Security Engineers

This two-axis model — false positives and false negatives — is not just academic. It directly determines whether your security testing program is giving you an accurate picture of your exposure. A tool heavy on false positives drowns your team in noise. A tool heavy on false negatives gives you dangerous false confidence.

Neither failure is acceptable in production security programs. The experiment described in this talk was designed precisely to quantify both failure modes across manual penetration testing and automated scanners — and the results expose a significant accuracy gap that most teams are unaware of.

Actionable Takeaways

  • Adopt the measurement-tool mindset when evaluating any security testing service or product: explicitly ask vendors for false positive rate and false negative rate (or recall/precision metrics) — not just "number of vulnerabilities found."
  • When reviewing security reports, distinguish between confirmed true positives (verified findings) and unverified detections. A high finding count with no verification process is a red flag for false positive inflation.
  • Build accuracy benchmarks into your security testing program by periodically comparing tool output against a manually established ground truth for a representative application or component.

Common Pitfalls

  • Treating vulnerability count as a proxy for security posture — a tool that finds 500 unverified findings is not better than one that finds 100 confirmed vulnerabilities. Volume without accuracy is noise.
  • Assuming zero false negatives from automated tools. Most automated scanners have significant omission rates — missing vulnerabilities that genuinely exist on the target system.

CVSSF: An Exponential Risk Scoring Model to Replace Linear CVSS Comparisons

The Problem with CVSS as a Risk Comparison Tool

CVSS[1] (Common Vulnerability Scoring System) is the industry standard for scoring vulnerability severity — and for good reason. It provides an objective, consistent basis for assessing individual vulnerabilities. But when organizations use CVSS to compare and aggregate vulnerabilities, two critical problems emerge that undermine sound risk decision-making.

Problem 1: Linear Addition of Non-Linear Risk

CVSS scores are not linearly additive. A CVSS 10 vulnerability is not equivalent to two CVSS 5 vulnerabilities. Risk does not work that way. A critical remote code execution flaw is categorically different in impact from two medium-severity misconfigurations — but CVSS arithmetic treats them as equivalent. Organizations that sum CVSS scores to calculate “total risk” are performing a mathematically invalid operation that produces a meaningless number.

The mole analogy from the talk captures this clearly: one malignant melanoma is not the same as 39 benign moles, regardless of what the count says.

Problem 2: Arbitrary Severity Ranges Create False Equivalences

CVSS uses four discrete severity bands: Low, Medium, High, and Critical (9.0+). But there is no meaningful rationale for treating a 9.0 and a 9.9 as the same “Critical” — a 9.9 is not marginally more dangerous, it is exponentially more dangerous. And instructing a team to “remediate all Criticals first” creates a perverse incentive: a CVSS 9.0 gets remediated immediately while a CVSS 8.9 waits, even though the severity difference is negligible. The range boundaries are arbitrary, yet they drive real remediation prioritization decisions.

Introducing CVSSF: Exponential Risk Scoring

To address these limitations, Fluid Attacks developed CVSSF[2] — a scoring model that uses the CVSS base score as an exponent to produce values that reflect the exponential, non-linear nature of real-world risk.

The core formula is:

CVSSF = P^(CVSS base score)

Where P is a proportionality constant — a business-defined variable that expresses how many vulnerabilities of a given severity are equivalent to one vulnerability one severity level higher. Fluid Attacks uses P = 4, meaning four CVSS 9 vulnerabilities are considered equivalent in risk to one CVSS 10 vulnerability.

Case Study: CVSSF Equivalence Table

Using the CVSSF formula (CVSSF = P^CVSS, where P=4), Fluid Attacks derived an equivalence table that concretely demonstrates the exponential gap in risk between vulnerability severity levels — showing that a single CVSS 10 vulnerability is equivalent to over 262,000 CVSS 1 vulnerabilities.

Formula and Calculation:

CVSSF = P^(CVSS base score)
where P = 4 (Fluid Attacks' chosen proportionality constant)

P = 4 means: four vulnerabilities at severity N are considered equivalent in risk to one vulnerability at severity N+1.

CVSS Score CVSSF Value (P=4)
1 4^1 = 4
2 4^2 = 16
3 4^3 = 64
4 4^4 = 256
5 4^5 = 1,024
6 4^6 = 4,096
7 4^7 = 16,384
8 4^8 = 65,536
9 4^9 = 262,144
10 4^10 = 1,048,576

Equivalence relationships derived:

  • A CVSS 4 vulnerability = 16 CVSS 2 vulnerabilities in risk weight
  • A CVSS 4 vulnerability = 64 CVSS 1 vulnerabilities in risk weight
  • A CVSS 10 vulnerability = 262,144+ CVSS 1 vulnerabilities in risk weight
  • A CVSS 10 vulnerability = 4 CVSS 9 vulnerabilities in risk weight (by definition of P=4)

Aggregation in practice: Summing CVSSF values across all 1,221 vulnerabilities in the experiment produced a total risk exposure of 461,499 — a single meaningful number. Closing a CVSS 10 finding drops that total by 1,048,576. Closing 50 CVSS 1 findings drops it by only 200. Remediation effectiveness becomes mathematically honest.

Three Capabilities CVSSF Unlocks

1. Meaningful vulnerability comparison

CVSSF allows you to say with precision whether remediating vulnerability A has more impact than remediating vulnerability B — not just based on severity labels, but based on actual relative risk weight. A CVSS 9.9 and a CVSS 9.0 are no longer treated as identical.

2. Remediation prioritization

Because CVSSF values reflect real risk magnitude, teams can prioritize remediation by impact rather than by arbitrary severity bands. Solving a CVSS 10 vulnerability has an outsized impact on total CVSSF that eliminating dozens of low-severity findings cannot match.

3. Risk exposure aggregation and time-series tracking

Because CVSSF values are mathematically aggregatable, you can sum them across all vulnerabilities in an application to produce a total risk exposure score. This enables:

  • Point-in-time risk snapshots: What is the current risk exposure of this application?
  • Trend analysis: Is our risk exposure improving or worsening week-over-week?
  • Remediation effectiveness measurement: If we fixed 39 vulnerabilities but left the one CVSS 10 untouched, our CVSSF total barely moved — and now we know that.

In the experiment, the ground-truth vulnerability universe had a total CVSSF of 461,499. Manual pen testing detected 420,000 of that exposure. The best automated tool detected less than 10% — a delta only visible because CVSSF allows aggregation.

Applying CVSSF in Your Organization

The P value (4 in Fluid Attacks’ case) is a business variable. Your organization can adjust it based on your risk tolerance and the specific equivalences that make sense for your context. The key insight is that some exponential model is vastly more accurate than treating CVSS scores as linearly additive.

Teams using CVSSF-style scoring can present executive stakeholders with a single number that actually tracks application risk over time — making security improvements visible and remediation decisions defensible.

Actionable Takeaways

  • Stop adding CVSS scores together or averaging them to represent "total risk." Instead, adopt an exponential aggregation model like CVSSF to produce a meaningful, additive risk exposure metric for your applications.
  • Use CVSSF (or an equivalent exponential model) to build a risk exposure dashboard that tracks your application's total vulnerability risk over time — making the impact of remediation sprints visible to engineering and business stakeholders.
  • When prioritizing remediation, use CVSSF values rather than CVSS severity bands. A single CVSS 9.5 vulnerability warrants more immediate attention than dozens of CVSS 5 findings — and CVSSF makes that prioritization mathematically explicit.

Common Pitfalls

  • Using CVSS severity ranges (Low/Medium/High/Critical) as the primary remediation prioritization mechanism — the range boundaries are arbitrary and cause teams to treat a CVSS 9.0 and 8.9 differently when there is negligible actual risk difference.
  • Measuring remediation effectiveness by "number of vulnerabilities closed" rather than by change in total risk exposure — closing 50 low-severity findings while leaving one critical untouched may produce near-zero improvement in actual risk.

Experiment Design and Methodology: Benchmarking Manual Pentesting Against Automated AST Tools

Experiment Goals and Scope

The central question driving this experiment was not “which scanner is best?” — tool-vs-tool comparisons are common and often vendor-influenced. The question was more fundamental: how does the entire category of automated AST tools compare against expert manual penetration testing when evaluated as measurement instruments?

This framing matters. The experiment was designed to assess accuracy at the category level, not to produce a product ranking. The automated tools used were anonymized in the results specifically to prevent the findings from being misread as a vendor comparison.

The experiment was ongoing at the time of the talk — 10 months in, conducted by Fluid Attacks’ internal research team, with results scoped to a single application under controlled conditions.

Target Application Selection

Selecting the right target application was critical to producing valid results. The requirements:

  • Realistic and modern architecture — to reflect real-world attack surfaces
  • Open source and vulnerable by default — to allow controlled, legal testing without liability
  • Representative technology stack — to limit scope while remaining meaningful

The chosen application uses a Model-View-Controller (MVC) architecture, is a single-page application (SPA) with a RESTful API backend, and is built on Node.js. The specific name was withheld, but it is described as one of the most commonly used vulnerable-by-default applications available publicly.

Establishing the Ground-Truth Vulnerability Universe

Case Study: Senior Hacker Baseline for Ground-Truth Detection

To create a valid experimental benchmark, Fluid Attacks assigned a senior hacker to exhaustively enumerate all vulnerabilities in the target application over approximately two months using every available testing method — producing a ground-truth universe of 1,221 vulnerabilities and a total CVSSF exposure of 461,499.

Before any comparison could happen, a definitive ground truth had to be established. A senior hacker — one of Fluid Attacks’ most experienced researchers — was assigned to exhaustively enumerate the application’s vulnerabilities using:

  • Manual penetration testing (authenticated and unauthenticated)
  • Static analysis / source code review
  • Automated vulnerability scanning (multiple tools)
  • Dependency and software composition analysis (SCA)

Duration: Approximately two months of continuous testing by a single senior researcher.

Output: 1,221 vulnerabilities across 105 vulnerability categories, with a total CVSSF risk exposure of 461,499. This universe became the denominator for all subsequent F-score calculations.

A Critical Finding: Authentication-Gated Vulnerabilities

Case Study: 73.2% of Findings Require Login — Impact on DAST Tool Accuracy

73.2% of all vulnerabilities — and approximately 90% of total CVSSF risk exposure — were only accessible after successful authentication, creating a structural blind spot for any DAST tool that cannot perform authenticated scanning.

One of the most consequential findings from the enumeration phase was the distribution by authentication requirement:

Access Context % of Vulnerabilities % of CVSSF Risk Exposure
Unauthenticated ~26.8% (~327 findings) ~10%
Authenticated 73.2% (~894 findings) ~90%

This is a critical constraint for automated DAST (Dynamic Application Security Testing) tools, many of which do not support authenticated scanning. For any unauthenticated DAST tool:

  • Maximum possible recall: ~26.8% (only the unauthenticated surface is reachable)
  • Maximum possible CVSSF risk detection: ~10% of total exposure
  • This is a hard architectural ceiling, not a tuning problem — no amount of scanner improvement overcomes the inability to authenticate

The Comparison Phase

With the ground truth established, a separate manual pen testing team (distinct from the enumeration hacker) was assigned to test the same application. Using a different team eliminated bias — they had no prior knowledge of the enumerated findings and approached the application fresh.

The manual pen testing team worked for 49 days — a significant operational downside compared to automated tools that complete in minutes.

In parallel, the suite of automated scanners ran against the same application. The scanners’ identities were anonymized, and a “best tool” representative was used for the primary comparative analysis.

Scope Constraints and Applicability

The experiment has explicit constraints that inform generalization:

  • Results are for one application — different technology stacks may produce different absolute numbers
  • The language and framework affect which tools perform better or worse
  • Scanner capabilities (authenticated vs. unauthenticated, SAST vs. DAST vs. SCA) directly constrain detection
  • Manual pen testing execution time (49 days) is a real operational constraint

These constraints do not invalidate the findings — they contextualize them. The magnitude of the accuracy gap found is large enough that the directional conclusion holds even if absolute numbers vary across targets.

Actionable Takeaways

  • When evaluating automated AST tools, explicitly test them against an application with known vulnerabilities (a vulnerable-by-default application or an internal test bed) and compare the tool's findings against your ground truth — not just against other tools.
  • Verify whether your DAST tools support authenticated scanning before deploying them. If they cannot authenticate, assume they are structurally blind to 70%+ of vulnerabilities in applications that require login to access most functionality.
  • Use the two-team methodology when establishing a ground truth for your own applications: have one team enumerate, then have a separate team test independently — this eliminates anchoring bias and gives you a cleaner signal on your testing program's coverage.

Common Pitfalls

  • Deploying DAST scanners without authentication configuration and treating the results as a representative picture of application risk — unauthenticated scans miss the majority of vulnerabilities in most modern web applications.
  • Conflating "the scanner found 300 issues" with "we have 300 vulnerabilities" — without a ground truth, you cannot distinguish true positives from false positives or know what the tool missed.

Measurement Theory Applied to Security Testing: Precision, Recall, and F-Score

Applying Measurement Theory to Security Testing

To rigorously evaluate false positives and false negatives in security testing, the experiment borrowed a framework from information retrieval and machine learning: F-score (also called F-measure). This framework transforms qualitative claims about tool accuracy into precise, comparable numbers — enabling apples-to-apples evaluation of very different testing approaches.

The Four Outcome Categories

Every vulnerability finding falls into one of four categories:

Category Definition
True Positive (TP) A vulnerability the tool reported that actually exists
False Positive (FP) A vulnerability the tool reported that does NOT exist
True Negative (TN) A non-vulnerability the tool correctly did not flag
False Negative (FN) A vulnerability that EXISTS but the tool did NOT report

A good security tool maximizes true positives while minimizing false positives and false negatives. The challenge: optimizing for one often trades off against the other.

Precision: Penalizing Lies

Precision measures how trustworthy a tool’s positive findings are:

Precision = TP / (TP + FP)

A tool with high precision tells few lies. If it reports 100 vulnerabilities and 95 are real, precision is 95%. If only 40 are real, precision is 40% — and your team wastes 60% of its remediation effort chasing ghosts.

Precision is the metric most relevant to developers and remediation teams. A false positive costs a developer investigation time — time that compounds across every false positive in every report. A tool with low precision erodes team trust until findings are routinely ignored.

Recall: Penalizing Omissions

Recall (also called sensitivity) measures how completely a tool surfaces real vulnerabilities:

Recall = TP / (TP + FN)

A tool with high recall misses few real vulnerabilities. If there are 100 real vulnerabilities and the tool finds 90, recall is 90%. If it only finds 30, recall is 30% — and 70 real vulnerabilities remain in production undetected and unpatched.

Recall is the metric most relevant to risk and security leadership. An undetected critical vulnerability is not just a missed finding — it is a live exposure that exists in production regardless of whether any scanner found it.

F-Score: Balancing Precision and Recall

The F-score (F-beta score)[3] combines precision and recall into a single metric using a weighting parameter beta (β):

F-beta = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

Three variants are used in the experiment:

F1 Score (β = 1): Equal weight to precision and recall

The most commonly cited variant. F1 treats false positives and false negatives as equally costly.

Interpretation benchmarks:

  • F1 > 80% — strong tool; detected most real vulnerabilities while keeping false positives low
  • F1 < 50% — worse than random guessing; findings cannot be trusted as representative

F0.5 Score (β = 0.5): Precision-weighted — developer use case

F0.5 penalizes false positives more harshly. Use this when selecting tools for CI/CD integration or developer-facing workflows — developers want a tool that, when it flags something, is almost certainly right.

F2 Score (β = 2): Recall-weighted — risk use case

F2 penalizes false negatives more harshly. Use this when evaluating coverage for risk assessment — missing a critical vulnerability is catastrophic, far worse than investigating a false positive.

In the experiment, F2 was evaluated specifically against CVSSF-weighted risk exposure — measuring not just whether vulnerabilities were found, but whether the high-severity, high-impact ones were found.

Reading the Scores

F-Score Range Interpretation
> 80% Strong — accurate, reliable, fit for production use
50–80% Acceptable in specific contexts; understand which axis is weak
< 50% Worse than random guessing — output cannot be trusted as a representative security signal

The majority of automated scanners in the experiment scored below 50% on F1. This means their vulnerability reports are less informative than a coin flip about whether a given finding represents a real issue in the application.

Case Study: F-Score Comparison Across F1, F0.5, and F2 Metrics

By applying the F-score framework, Fluid Attacks produced a quantified, side-by-side accuracy comparison — revealing an F1-score gap of 84% vs 36.9% and a risk-exposure detection gap of 91% vs less than 10%.

Approach True Positives False Positives False Negatives CVSSF Detected
Manual pen testing 872 0 349 ~420,000
Best automated tool low (est.) significant ~1,176+ (est.) <46,149 (<10%)

Calculated scores:

Manual Precision  = 872 / (872 + 0) = 100%
Manual Recall     = 872 / 1,221 = ~71.4%
Manual F1         = 84%   ← above the 80% "strong tool" threshold
Manual F0.5       = 93%   ← very few false positives
Manual F2         = 92%   ← detected overwhelming majority of high-severity risk

Best Tool F1      = 36.9% ← below the 50% "worse than random" threshold
Best Tool F2      = ~16%  ← missed 84% of the risk-weighted vulnerability surface

Actionable Takeaways

  • Use F0.5 score as your primary evaluation metric when selecting a security scanner for developer-facing workflows (CI/CD integration, pre-commit hooks, IDE plugins) — prioritize tools that minimize false positives even at some cost to coverage.
  • Use F2 score (weighted toward recall, evaluated against CVSSF-weighted risk exposure) when selecting or evaluating security testing services for risk assessment purposes — missing a critical vulnerability is far costlier than investigating a false positive.
  • Request F-score benchmarks (or the underlying precision and recall numbers) from any security tool vendor before procurement. If a vendor cannot provide these metrics against a known-vulnerable application, treat their accuracy claims as unverified.

Common Pitfalls

  • Using F1 score alone for all evaluation contexts — F1's equal weighting of precision and recall masks the tradeoffs that matter most for specific use cases (developer productivity vs. risk coverage).
  • Interpreting a high true-positive count in isolation without normalizing against the ground truth. A tool that finds 500 vulnerabilities when the real total is 1,000 has 50% recall — which sounds reasonable until you realize it missed half the application's real risk.

Experiment Results and Conclusions: The Accuracy Gap Between Manual and Automated Security Testing

Manual Penetration Testing Results

The second manual pen testing team — operating independently from the hacker who established the ground-truth universe — produced results that validate human-led testing as the highest-accuracy security measurement approach:

Metric Manual Pen Testing
True Positives 872 confirmed vulnerabilities
False Positives 0
F1 Score 84%
F0.5 Score 93%
F2 Score (risk-weighted) 92%
CVSSF Risk Exposure Detected 420,000 of 461,499 (91%)
Execution Time 49 days

The zero false positive result is significant: every vulnerability the manual team reported was verified as real. No remediation effort was wasted. The 84% F1 score clears the “strong tool” threshold, and the 92% F2 score against CVSSF-weighted risk means the team found the overwhelming majority of high-severity, high-impact vulnerabilities — not just the easy ones.

Best Automated Tool Results

Metric Best Automated Tool
F1 Score 36.9%
F2 Score (risk-weighted) ~16%
CVSSF Risk Exposure Detected < 10% of total
Execution Time Minutes

An F1 score of 36.9% falls below the 50% threshold defined as “worse than random guessing.” The automated tool’s output was less reliable than a coin flip at representing the application’s actual security state. More critically, it detected less than 10% of total CVSSF risk exposure — meaning over 90% of the application’s real risk remained invisible to the scanner.

The false negatives of the best tool were staggering: the vulnerabilities it missed represented roughly the same total CVSSF exposure as everything manual pen testing found.

The 74.8% Finding: What Only Humans Found

74.8% of the vulnerabilities in the ground-truth universe were detected exclusively by manual penetration testing. No automated tool in the experiment was able to find these vulnerabilities. This is not a marginal edge case — it is the majority of the application’s vulnerability surface.

Any security program that relies exclusively on automated scanning is structurally blind to at least three-quarters of its application’s vulnerabilities — and, given the CVSSF distribution, likely blind to the majority of its highest-severity risk.

The Speed vs. Accuracy Tradeoff

The one dimension where automated tools clearly win is execution time. A scanner completes in minutes. Manual pen testing in this experiment required 49 days. This is a real operational constraint.

The practical implication is not “automate everything” or “use only humans,” but to understand the tradeoff explicitly:

  • Automated tools — fast, cheap, continuous, useful for catching common, well-known vulnerability classes early in the SDLC. Low accuracy, high throughput.
  • Manual pen testing — slow, expensive, periodic, but captures the vulnerability surface that automated tools cannot: authenticated vulnerabilities, complex business logic flaws, and chained attack paths that require human creativity.

Four Conclusions for Security Engineers

  1. Manual intervention is still necessary to achieve high AST accuracy. No automated tool in a 10-month experiment came close to matching the coverage and risk detection of a skilled human team.

  2. The best automated tool achieved only 36.9% F1-score. Treat automated scan results as a starting point and triage layer — not a complete picture of your security posture.

  3. Automated tools captured less than 10% of the risk exposure detected by manual pen testing. If your security program is built on automated scanning alone, assume you are missing 90%+ of your application’s real risk exposure.

  4. Authentication-gated vulnerabilities account for 73%+ of findings and ~90% of risk. Any tool that cannot authenticate will structurally miss most of what matters.

Actionable Takeaways

  • Design your security testing program to include both automated scanning (for speed and CI/CD integration) and periodic manual penetration testing (for coverage) — and communicate to stakeholders that automated-only programs have a structural accuracy ceiling well below 50% F1.
  • When reporting security posture to leadership, use CVSSF-weighted risk exposure as your primary metric rather than vulnerability count — it provides a meaningful, aggregatable number that reflects the actual severity distribution of findings.
  • Benchmark your current security tools against a known-vulnerable application to establish your actual F1/F0.5/F2 baseline before making investment decisions about tooling or testing services.

Common Pitfalls

  • Presenting automated scan results to executives or compliance auditors as evidence of a complete security assessment — without disclosing that automated tools have demonstrated <37% F1-score accuracy in controlled benchmarks, this misrepresents the real state of application security.
  • Optimizing your security program for speed (faster scans, more frequent automated runs) without addressing the structural recall gap — running a low-accuracy scanner more frequently does not improve coverage of the vulnerabilities it is architecturally incapable of finding.

Conclusion

The Hackuracy experiment is one of the most rigorous empirical benchmarks of application security testing accuracy published to date. Its findings are unambiguous: automated tools, even the best ones, operate far below the accuracy threshold required for reliable security measurement. An F1 score of 36.9% and less than 10% CVSSF risk exposure detection are not limitations to engineer around — they are the ceiling of what the current generation of automated scanners can achieve against a realistic, authenticated application.

The solution is not to abandon automation. It is to use automation honestly — as a fast, continuous first pass that catches the low-hanging fruit — while investing in expert manual penetration testing for the coverage, precision, and risk-weighted recall that automated tools cannot provide. And when measuring risk, replace CVSS arithmetic with an exponential model like CVSSF that reflects how risk actually compounds.

For further reading, explore related discussions on vulnerability management and security metrics on this site.


References & Tools

  1. CVSS (Common Vulnerability Scoring System) — Industry-standard framework for scoring vulnerability severity, maintained by FIRST.
  2. CVSSF (CVSS for Fluid Attacks) — Fluid Attacks' exponential risk scoring model using CVSS base scores as exponents to produce aggregatable, non-linear risk values.
  3. F-score / F-beta score — Statistical measurement framework from information retrieval and machine learning, applied here to quantify security testing accuracy.
Frequently asked

Questions from the audience

What is CVSSF and how does it differ from standard CVSS scoring?
CVSSF (CVSS for Fluid Attacks) uses the CVSS base score as an exponent — CVSSF = P^(CVSS score), where P=4 by default. Unlike CVSS, which treats scores linearly, CVSSF reflects the exponential nature of real risk: a CVSS 10 vulnerability equals 262,000+ CVSS 1 vulnerabilities in CVSSF terms. Critically, CVSSF values are aggregatable, enabling a single total risk exposure score for an entire application.
Why did the best automated AST tool score below 50% on the F1 metric?
F1 below 50% means the tool performs worse than random guessing at identifying real vulnerabilities. The primary drivers were a high false negative rate (missing most vulnerabilities, especially authenticated ones) and a meaningful false positive rate. Since 73.2% of the application's vulnerabilities required authentication to discover, tools without authenticated scanning capability faced a structural ceiling they could not overcome regardless of tuning.
Which F-score variant should I use when selecting a security testing tool?
It depends on who you are. Developers integrating scanners into CI/CD pipelines should optimize for F0.5 (precision-weighted) — it penalizes false positives most harshly, prioritizing tools that don't waste developer time on ghost findings. Security engineers and CISOs evaluating coverage should optimize for F2 (recall-weighted, scored against CVSSF risk exposure) — it penalizes missing high-severity vulnerabilities, which is the costlier failure mode for risk programs.
Does this experiment mean I should stop using automated scanners altogether?
No — the conclusion is calibration, not elimination. Automated tools are fast, cheap, and enable continuous feedback in the SDLC. Their value is catching well-known vulnerability classes early and often. The problem is treating them as a complete security measurement. The data shows automated tools miss 74.8% of vulnerabilities and 90%+ of risk exposure. Use them as a triage layer and complement them with periodic expert manual penetration testing for full coverage.
Watch on YouTube
Hackuracy: Boosting AST Accuracy Through Hacking
Andres Roldan, · 40 min
Watch talk
Keep reading

Related deep dives