
A server sanitizes HTML perfectly — then the browser parses the output and executes the XSS payload anyway. Server-side HTML sanitization vulnerabilities aren’t the result of poor implementation; they’re structurally inevitable, because HTML 4 parsers on the server see different content than HTML 5 browsers render on the client.
This post breaks down Yaniv Nizry’s OWASP AppSec USA 2024 research tracing a single Grav CMS bypass back to libxml2[1], exposing a cascade of parser differential vulnerabilities across libraries like HTMLPurifier[2] and TYPO3[3]. You’ll learn why the correct architectural fix is client-side sanitization using DOMPurify or the browser-native Sanitizer API[4].
Key Takeaways
- You'll understand why server-side HTML sanitization is structurally flawed due to parser differentials, round-trip parsing failures, and browser version inconsistencies — and why these gaps cannot be patched away.
- You'll be able to identify how a single vulnerability in a shared HTML parsing library (libxml2) can cascade across multiple PHP sanitizers and downstream applications, multiplying the blast radius of one bypass.
- Apply the principle of sanitizing where content is rendered — on the victim's client side using tools like DOMPurify or the emerging Sanitizer API — to eliminate an entire class of XSS bypass vectors.
How HTML Sanitizers Work and Why They Exist
Cross-site scripting (XSS) occurs when an attacker injects JavaScript into a web page that is subsequently rendered and executed in the context of another user’s browser. The canonical example is a comment field: a legitimate user posts plain text, but an attacker posts a malicious payload such as an <img> tag with an onerror event handler. If the server does not handle the input correctly, any victim who views that page will have arbitrary JavaScript executed in their browser session.
The simplest mitigation is HTML entity encoding — escaping characters like < and > so they are rendered as literal text. However, this approach is too restrictive for applications that genuinely need to support formatted content: headings, bullet lists, embedded images, and other HTML features. As the set of allowed features expands, so does the attack surface.
The Three-Step Sanitization Model
To allow rich HTML while blocking script execution, applications use HTML sanitizers. The core mechanism follows three deterministic steps:
-
Parse the untrusted input string into a DOM tree. The raw string submitted by the user has no inherent structure. The sanitizer runs it through an HTML parser, producing a structured, traversable document object model — a tree of nodes representing elements, attributes, and text content. The sanitizer can now reason about the document semantically rather than treating it as opaque text.
-
Iterate the DOM tree and filter disallowed nodes. The sanitizer walks every node and applies an allow-list or deny-list policy. An
onerrorevent handler attribute is identified as script-executing and removed. Elements like<script>or<iframe>that are not on the approved list are stripped entirely. What remains is a tree that, according to the sanitizer’s policy, contains only permitted markup. -
Serialize the filtered tree back to an HTML string. The sanitized DOM object is converted back into a string and passed downstream — stored in a database and later served to other users, or returned directly in the HTTP response.
Configurability and the Allow-List Model
Sanitizers are intentionally configurable because different applications have different requirements. A blogging platform may permit <h2>, <ul>, and <img> but prohibit <form> and <input>. A documentation tool may allow code blocks. This per-deployment configuration means the same sanitizer library may present a different attack surface depending on which features an application chooses to enable. Allowing more HTML features directly increases the number of potential bypass vectors — a critical consideration when examining how parser differentials are exploited.
Sanitizers Are Software and Can Be Bypassed
Sanitizers are pieces of software like any other — they can contain bugs, and those bugs can be exploited to bypass intended protections. The impact of a sanitizer bypass is context-dependent: it inherits the impact of XSS, which ranges from negligible in low-privilege contexts to full account takeover or remote code execution in high-privilege applications.
This foundational model — parse, filter, serialize — is where the architecture appears sound in theory. The structural failures emerge from the assumptions baked into the parsing and serialization steps.
Actionable Takeaways
- Audit every application that accepts and re-renders user-supplied HTML to confirm whether a sanitizer is in place and whether its allow-list configuration is appropriately restrictive. Enabling HTML features such as comments, certain block elements, or non-standard attributes expands the exploitable surface area proportionally.
- Treat sanitizer bypass as a realistic threat model, not an edge case. Because sanitizers are software, they carry CVE risk. Track the security advisories of any third-party sanitizer library in use the same way you track vulnerabilities in any other dependency.
Common Pitfalls
- Conflating HTML escaping with HTML sanitization. Escaping is appropriate when no HTML should be rendered at all, but applying it in contexts where rich HTML is intentionally supported breaks the user experience without providing a correct security boundary.
HTML 5 Parser Differentials and Server-Side Sanitization Vulnerabilities
Server-side HTML sanitization vulnerabilities are not edge cases — they are a structural inevitability rooted in how HTML 5 redefined the parsing specification. The critical assumption embedded in the sanitization model is that the parser used at sanitization time produces an identical DOM to the one the victim’s browser will produce when rendering the final output. HTML 5 invalidates this assumption at multiple levels.
The HTML 4 vs. HTML 5 Comment Parsing Differential
HTML 5 introduced new features and updated specifications that changed how browsers interpret markup — including comment syntax. In HTML 4, a comment block is straightforward: it starts with <!-- and ends with -->. HTML 5 introduced additional rules that affect where a comment terminates, including handling of sequences like --!> that were previously treated as part of the comment body.
An attacker can craft a payload that, when parsed by a server-side parser implementing HTML 4 semantics, appears as a single benign comment node. No malicious elements are visible in the resulting DOM tree, so the sanitizer passes it through untouched. When that same string is rendered by a modern HTML 5 browser, the comment terminates early and everything following it — including an injected <img onerror=...> or similar payload — is parsed as live HTML. The XSS fires on the victim’s machine even though the server-side sanitizer saw nothing to remove.
This is not a bug in any one sanitizer. It is a direct consequence of the parser used during sanitization implementing a different specification than the one the browser uses at render time.
HTML 5 vs HTML 4 Comment Parsing Differential Bypass
A server-side HTML sanitization bypass exploits the differential between HTML 4 and HTML 5 comment parsing specifications: a PHP-based sanitizer using libxml2[1] parses injected markup as a benign comment (HTML 4 semantics), strips no malicious content, then forwards the payload to the browser, which re-parses it under HTML 5 rules, terminates the comment prematurely, and renders an executable element — achieving a complete XSS bypass.
How the bypass works step by step:
- In HTML 4, a comment begins with
<!--and ends with-->. A sanitizer relying on an HTML 4-compliant parser (PHP’s built-in parser backed by libxml2) classifies the entire injected string as a single comment node — nothing to flag or remove. - HTML 5 introduced an updated comment termination rule. Certain sequences (e.g.,
--!>) that HTML 4 treats as part of the comment body are treated by HTML 5 as comment terminators. - The attacker constructs a payload: a string beginning with
<!--that includes an HTML5-specific comment-terminator, followed by<img src=x onerror=alert(1)>. - The payload is submitted to the PHP-based sanitizer. libxml2 parses it as a single comment node. The sanitizer finds no disallowed elements and passes the payload through unchanged.
- The victim’s browser receives the serialized output and re-parses under HTML 5 rules. The comment terminates early. The
<img>element is now a live DOM node. - The
onerrorhandler fires — XSS executes. No non-default sanitizer configuration is required. - Because multiple PHP sanitizers all delegate parsing to PHP’s built-in parser (backed by libxml2), any sanitizer that permits comments inherits this identical bypass.
Browser and Version Differentials Beyond Specification Version
Even if a server-side sanitizer correctly implemented the full HTML 5 specification, additional failure modes remain:
- Cross-browser differentials: HTML 5 parsing is not perfectly uniform across major browsers. XSS payloads have been observed that trigger in Chrome but not Firefox, or vice versa, due to subtle implementation differences in how each engine handles ambiguous markup.
- Browser version differentials: Because HTML is an evolving standard, a user running an outdated browser version may parse the same string differently than a current version. A server sanitizing for today’s spec cannot account for yesterday’s renderer on the victim’s machine.
Actionable Takeaways
- Audit all server-side sanitizers in your PHP stack to determine whether they use PHP's built-in HTML parser (backed by libxml2). Any sanitizer relying on this parser for HTML 5 content is structurally vulnerable to comment-differential and related bypasses regardless of its allow-list configuration. Replace or supplement with a client-side sanitization layer.
- When evaluating a sanitizer's security posture, explicitly test round-trip parsing behavior: serialize the sanitized DOM back to a string, re-parse it in a target browser, and inspect the resulting DOM for any injected nodes that were not present after the initial sanitization pass. This directly tests for mXSS exposure.
- Treat parser differentials as a class of vulnerability, not a one-off CVE. When a new HTML 5 feature is adopted in your application's allowed element or attribute set, re-evaluate whether your server-side parser correctly implements the relevant HTML 5 specification for that feature.
Common Pitfalls
- Assuming that fixing a specific bypass payload closes the vulnerability class. Multiple PHP sanitizers received reports of differential-based bypasses, and several either closed advisories without fully addressing the root cause or argued that non-default configurations enabling the vulnerable feature were the user's responsibility.
- Conflating "parses HTML 5" with "parsing is equivalent to the victim's browser." A third-party library that claims HTML 5 support may still produce parse trees that diverge from Chrome or Firefox under ambiguous or edge-case markup.
The Cascade Effect — How One Parser Bug Compromises Many Applications
One of the most striking findings in this research is not a single CVE in a single library — it is the structural amplification pattern that causes one parsing differential to cascade silently through an entire ecosystem of PHP-based sanitizers and into the applications that depend on them.
The Amplification Pyramid
The cascade follows a straightforward dependency chain, but its security implications are severe:
- libxml2[1] — A widely used C library for parsing XML and HTML. Multiple language-level HTML parsers are built on top of it.
- PHP’s built-in HTML parser[5] — PHP ships a native HTML parser that internally uses libxml2. Because it is built-in and convenient, it is the natural default for any PHP developer building a sanitizer.
- Third-party PHP sanitizers — Libraries like DOM Sanitizer[6] (used in Grav), HTMLPurifier[2], and others used in platforms like Magento 2 and TYPO3[3] all leverage PHP’s built-in parser as their first-pass parsing step.
- End applications — Any application that depends on one of these sanitizers inherits the parser’s behavioral quirks.
One issue in libxml2 propagates upward to PHP’s parser, fans out horizontally across every sanitizer using that parser, and then fans out further to every application using those sanitizers.
libxml2 and PHP Built-In Parser Cascade — Single Bug, Multiple Sanitizer Impact
Multiple PHP sanitizers (DOM Sanitizer/Grav, HTMLPurifier, Magento 2, TYPO3) all call PHP’s built-in HTML parser, which internally uses libxml2. Because libxml2 does not support HTML 5, its HTML 4-era parsing model creates a differential against every modern browser. The attacker crafts a comment-based payload exploiting the HTML 4/HTML 5 comment termination differential. Each sanitizer receives the payload, parses via libxml2, sees only a comment node, and passes the payload through unchanged. The victim’s browser re-parses under HTML 5, terminates the comment early, and renders the injected element as live HTML. All affected sanitizers were bypassed with the same payload class because the root cause was not in any sanitizer’s own logic — it was in the shared libxml2 dependency.
Disclosure Outcomes: The Blame Deflection Chain
The responsible disclosure process revealed systemic accountability gaps across the dependency chain:
- libxml2: Acknowledged they never claimed HTML 5 support. Directed findings downstream. An open discussion remains unresolved due to maintenance capacity issues.
- PHP: Acknowledged lack of HTML 5 support. Added a documentation warning advising against using the built-in parser for sanitization. PHP 8.4 later introduced HTML 5 parsing via a different underlying parser, but this does not resolve the fundamental problem of server-side sanitization.
- Third-party HTML 5 parser library: A library explicitly claiming HTML 5 support was also found to have differentials. Maintainers declined to fix it, citing that sanitization use was outside their scope.
- DOM Sanitizer (Grav)[6]: Vulnerable in default configuration. Maintainers claimed to fix it and closed the advisory — making it non-public — but the original payloads remained exploitable. Requests to reopen the advisory received no reply.
- HTMLPurifier[2]: Vulnerable under non-default configuration (e.g., when comments are explicitly enabled). Maintainer position was that enabling comments is equivalent to the user allowing XSS. No resolution.
- Magento 2: Vulnerable in non-default configuration, not used that way in production. Not fixed due to capacity and a position that the parser differential is “not actionable.”
- TYPO3[3]: Vulnerable in default configuration. The only project that fixed the issue and assigned a CVE.
DOM Sanitizer (Grav CMS) Default Configuration Bypass
DOM Sanitizer was bypassable in its default configuration — no special developer settings required to reach the vulnerable state. The researcher’s colleague found an initial bypass. Within minutes, the researcher identified multiple additional bypasses using the same HTML 4/HTML 5 comment differential. A private GitHub security advisory was opened with specific payloads and root cause explanation. The vendor claimed a fix, closed the advisory (making it non-public), and no CVE was issued. Retesting confirmed the exact same payloads remained exploitable after the claimed fix. Requests to reopen the advisory received no response.
HTMLPurifier Non-Default Comment Allowlist Bypass
HTMLPurifier[2] is not vulnerable in its default configuration. The operator must explicitly enable HTML comments. With comments allowed, an attacker submits a payload using an HTML 5-specific comment terminator. libxml2 (via PHP’s parser) sees a complete comment; the browser terminates the comment early and renders the injected element. The maintainer argued that enabling comments in configuration is equivalent to the operator permitting XSS. The researchers challenged this — a user enabling comments expects to allow only comments, not XSS. No reply received, no fix issued.
TYPO3 Default Configuration XSS Bypass and CVE Assignment
TYPO3[3] used a PHP-based sanitizer relying on PHP’s built-in HTML parser (libxml2). Vulnerable in default configuration — no administrator misconfiguration required. Same comment-differential payload: libxml2 sees a comment node; the HTML 5 browser terminates the comment early and renders the injected element as live HTML. TYPO3 acknowledged the finding, produced a fix, and assigned a CVE — the only project in the entire disclosure chain to do so.
Actionable Takeaways
- Audit your dependency tree for any PHP-based HTML sanitizer that uses PHP's built-in HTML parser (which is backed by libxml2). Treat any such sanitizer as structurally untrustworthy for inputs that will be rendered in modern browsers, regardless of whether a specific CVE has been assigned.
- When evaluating third-party sanitizer libraries, check not only the library's own security record but also the parsing backend it delegates to. Treat closed-without-public-detail advisories (as seen with DOMSanitizer) as a red flag for supply chain risk assessment.
- If your organization maintains applications that use PHP sanitizers affected by this cascade, treat client-side re-sanitization with DOMPurify as a compensating control until a structural architectural change can be implemented.
Common Pitfalls
- Assuming that a sanitizer fix at one layer in the dependency chain resolves the vulnerability for all downstream consumers. A "fixed" status at the library level does not guarantee that applications using that library are protected. Independent verification of fix efficacy against the original payload set is required.
- Conflating "the parser was not designed for sanitization" with "the sanitizer is therefore safe to use." Sanitizers built on top of PHP's parser and libxml2 inherit their parsing behavior unconditionally.
Round-Trip Parsing, Mutation XSS, and the Limits of Server-Side Context
Even after controlling for HTML version differentials, server-side sanitization remains fundamentally broken at the architectural level. The speaker explicitly addresses this scenario — assuming identical parsers, identical HTML versions, and identical browser targets — and demonstrates that critical failure modes still exist.
The Parsing Round-Trip Divergence Problem
The core structural failure works as follows:
- The server receives untrusted input as a raw string.
- The server parses that string into a DOM tree object.
- The sanitizer iterates over the DOM tree and removes disallowed nodes and attributes, producing a clean DOM.
- To transmit the sanitized content to the client, the server must serialize that DOM tree back into an HTML string.
- The client receives that string and re-parses it into its own DOM.
The critical flaw is in step 5. Re-parsing the serialized output does not guarantee the same DOM structure. The browser may produce a materially different page from what the server’s sanitizer approved. As the speaker states: “it does not actually mean that they will have the same outcome — they can have a totally different page with suddenly malicious input in it.” This is not a bug in any specific implementation; it is the defined behavior of the HTML specification.
Round-Trip Parsing Divergence in Practice
Even with identical parsers on server and client, the parse-sanitize-serialize-reparse cycle can produce a materially different DOM. The server parses untrusted input into a DOM tree, sanitizes it (removing all disallowed nodes), then serializes the clean DOM back to an HTML string for transmission. The client receives the string and re-parses it. Due to HTML round-trip parsing divergence, the re-parsed DOM is not identical to the server’s sanitized DOM. Markup that was inert in the server DOM becomes an active element in the browser DOM. The unexpected element fires its event handler, executing attacker JavaScript in the victim’s session. This is explicitly documented HTML specification behavior and is the foundation of mutation XSS (mXSS) techniques.
Mutation XSS Bypass Techniques
On top of round-trip divergence, mutation-based XSS (mXSS) represents a separate and additional class of bypass. mXSS exploits the fact that certain HTML strings, when parsed and re-serialized by a browser’s own engine, mutate into a form that was not present in the original input. The speaker describes seeing recent mXSS discoveries, indicating this is an active and evolving area of exploitation that continues to produce novel bypasses even against well-maintained sanitizers.
Context-Dependent Parsing
Parsing behavior changes depending on the context in which markup is parsed — for example, whether a fragment is parsed as a child of a <div>, a <table>, or an <svg>. A server-side sanitizer may parse a fragment in one context while the browser renders it in a different context, producing divergent DOM trees from identical input strings.
The Architectural Conclusion
The compound effect of specification version differentials, cross-browser implementation variance, browser version variance, round-trip parsing instability, and context-dependent parsing means there is no server-side configuration that can reliably guarantee parse-time equivalence with the victim’s render-time environment. Each is an independent axis of failure. An attacker needs to exploit only one. A defender must close all of them simultaneously — an inherently asymmetric problem that server-side sanitization cannot structurally solve.
Actionable Takeaways
- Treat server-side sanitization as a defense-in-depth measure only, never as the primary or sole XSS control. Architect your pipeline so that final sanitization occurs client-side, immediately before content is inserted into the live DOM.
- When auditing applications that perform server-side HTML sanitization, explicitly test for round-trip divergence by comparing the DOM produced by the server's parser against the DOM produced after the browser re-parses the serialized output.
- Track mXSS research actively, as mutation-based bypasses represent an evolving threat class that continues to produce novel bypasses even against well-maintained sanitizers. Subscribe to advisories for any client-side sanitization libraries in use (such as DOMPurify).
Common Pitfalls
- Assuming that upgrading a server-side parser to full HTML5 compliance resolves the sanitization bypass problem. Round-trip divergence and mXSS persist regardless of HTML version alignment between server and client.
- Treating a sanitizer's "fix" to a reported bypass as a reliable signal of security. The speaker describes a case where a sanitizer vendor closed an advisory claiming a fix, but the exact same payloads remained bypassable.
Client-Side Sanitization Best Practices and the Browser-Native Sanitizer API
The core argument for client-side XSS sanitization stems from a fundamental mismatch in where the vulnerability is triggered versus where the mitigation is applied. XSS is not triggered on the server — it is triggered on the victim’s browser. Sanitizing at the server means applying a mitigation at the wrong point in the attack chain.
The speaker makes a precise distinction between the three nodes in the XSS attack flow: the attacker (who generates untrusted input), the server (which hosts the page), and the victim (whose browser renders and executes it). For other server-side vulnerability classes — SQL injection, SSRF, RCE — sanitizing on the server is logical because the vulnerability is triggered there. XSS is different. The only defensible position is to sanitize where rendering actually occurs — on the victim’s client.
Current Recommendation: DOMPurify on the Client
For security engineers making deployment decisions today, the speaker recommends DOMPurify[7] as the most robust available option for client-side sanitization:
- It is explicitly designed and documented for client-side use, though it can technically be used server-side. The maintainers themselves recommend client-side deployment.
- As a third-party library, it is not immune to bypasses. Recent bypasses have been found, including at least one described as significant.
- Despite its limitations, DOMPurify operating in the browser parse context eliminates the class of parser differential attacks that affect server-side sanitizers.
Future Architecture: The Browser-Native Sanitizer API
The longer-term solution is the Sanitizer API[4], a browser-native client-side sanitization interface currently under development. Its key architectural advantages:
- No parser differential: Built by the same browsers that render HTML, it uses the same parsing logic. The entire class of server-vs-client parsing mismatches is eliminated by design.
- Context awareness: Being native to the browser, it can be context-aware in ways external libraries cannot replicate — parsing behavior adapts to the actual insertion point in the DOM.
- No round-trip parsing: Eliminates the serialization/re-parsing cycle that creates mutation XSS bypass opportunities. Sanitization happens on the DOM tree directly, without an intermediate string form.
At the time of the talk, the Sanitizer API was available under a feature flag in Firefox and had been partially rolled out in Chrome, though still under active development. Once stable and widely available, its accessibility is expected to drive developers to adopt client-side sanitization by default — making the architecturally correct approach also the easiest one.
Actionable Takeaways
- Migrate sanitization to the client side using DOMPurify, deployed and executed in the victim's browser at the point of rendering. Remove or deprecate any server-side HTML sanitization logic that relies on PHP's built-in HTML parser or libxml2. Treat server-side sanitization as an additional defense layer only, not the primary control.
- Audit your current sanitizer configurations — both server-side and client-side — for non-default settings that enable comments, exotic HTML5 elements, or other features implicated in parser differential attacks. Restrict allowed HTML to the minimum necessary feature set.
- Track the Sanitizer API specification and browser implementation status and build a migration path into your future-state architecture. Plan to replace DOMPurify with the native Sanitizer API once it reaches stable availability across your target browser matrix.
Common Pitfalls
- Assuming that upgrading a server-side HTML parser to one that supports HTML5 resolves the underlying vulnerability class. PHP 8.4 introduced an HTML5-capable parser, but server-side parsing still suffers from browser version differentials, round-trip parsing inconsistencies, and mutation XSS bypass techniques.
- Treating a sanitizer vendor's closure of a security advisory as confirmation that the issue is resolved. At least one sanitizer (DOM Sanitizer) closed the advisory without actually fixing the reported bypasses, and the original payloads remained exploitable after the claimed fix.
Conclusion
Server-side HTML sanitization is not merely flawed in implementation — it is flawed in architecture. The research presented by Yaniv Nizry demonstrates that the parse-filter-serialize model breaks down across multiple independent axes: HTML specification version differentials, cross-browser parsing variance, browser version drift, round-trip serialization divergence, and mutation XSS. Each axis is an independent point of failure, and an attacker needs to exploit only one. The responsible disclosure chain — from libxml2 through PHP through multiple sanitizer libraries — revealed that no single vendor owns the problem, and most declined to fix it.
The path forward is clear: sanitize where content is rendered. DOMPurify[7] is today’s best option for application security teams deploying client-side sanitization. The browser-native Sanitizer API[4] represents the structural solution — built into the rendering engine itself, eliminating parser differentials and round-trip serialization by design. Security engineers should begin migrating sanitization logic to the client side now, treat server-side sanitization as defense-in-depth only, and track the Sanitizer API’s progress toward stable browser support.
For deeper context on the XSS vulnerability class and related defense strategies, explore our coverage of application security .
References & Tools
- libxml2 — A shared C library for parsing XML and HTML, used by PHP's built-in HTML parser. Does not support HTML 5 parsing specifications. ↩
- HTMLPurifier — A widely-used PHP HTML sanitizer library found vulnerable to comment-differential bypasses under non-default configuration. ↩
- TYPO3 Security — The only vendor in the disclosure chain that acknowledged the default-configuration vulnerability, issued a fix, and assigned a CVE. ↩
- Sanitizer API (MDN) — Browser-native client-side sanitization API under active development in Chrome and Firefox. ↩
- PHP DOMDocument — PHP's built-in HTML parser class that internally uses libxml2; documentation now warns against use for sanitization. ↩
- Grav CMS — Open-source flat-file CMS using DOM Sanitizer, found vulnerable in default configuration to HTML 4/HTML 5 parser differential bypasses. ↩
- DOMPurify — The most robust current option for client-side HTML sanitization; recommended for deployment in the victim's browser at the point of rendering. ↩
Questions from the audience
Related deep dives
Breaking AI Agents: Exploiting Managed Prompt Templates to Take Over Amazon Bedrock Agents
When Passports Execute: Exploiting AI Driven KYC Pipelines | [un]prompted 2026
Agents Exploiting Auth-by-One Errors | [un]prompted 2026