Skip to main content

Module tokenizer

Module tokenizer 

Available on crate features html and http only.
Expand description

A byte-faithful, low-allocation HTML tokenizer.

The tokenizer scans HTML into a stream of StartTag, EndTag, Text, Comment and Doctype events delivered to a TokenSink. It is the substrate for rama’s streaming HTML rewriting: token views borrow the input (no per-token allocation), and every byte of the input belongs to exactly one token’s raw() span, so an unmodified pass re-serializes to byte-identical output.

Unlike a DOM parser it builds no tree and decodes no character references — text and attribute values are exposed as raw bytes.

It is resumable (write + end) and handles HTML text modes (<script> / <style> / <textarea> / <title> / <plaintext> / …) plus the SVG/MathML foreign-content context needed to distinguish real CDATA from bogus comments. The identity property holds for all input.

Structs§

Attribute
A single attribute view.
Attributes
Iterator over a StartTag’s attributes.
Cdata
A CDATA section, e.g. <![CDATA[ x ]]> (only emitted inside foreign content — SVG/MathML; elsewhere <![CDATA[ is a bogus comment).
Comment
A comment, e.g. <!-- hi -->.
Doctype
A document type declaration, e.g. <!DOCTYPE html>.
EndTag
An end tag, e.g. </a>.
LocalNameHash
A 64-bit hash of an ASCII-lowercased tag name.
ParsingAmbiguityError
Raised when text-mode context can’t be determined in strict mode.
StartTag
A start tag, e.g. <a href="/x"> or <br/>.
Text
A run of character data (text). Raw bytes, not entity-decoded.
Tokenizer
A byte-faithful, low-allocation, resumable HTML tokenizer.

Enums§

HtmlTag
A classified HTML tag name.

Traits§

TokenSink
Receives token events as the tokenizer scans HTML.

Functions§

tokenize
Tokenizes input in one pass with the default (non-strict) tokenizer, dispatching events to sink.