Module tokenizer
html and http only.Expand description
A byte-faithful, low-allocation HTML tokenizer.
The tokenizer scans HTML into a stream of StartTag, EndTag,
Text, Comment and Doctype events delivered to a
TokenSink. It is the substrate for rama’s streaming HTML rewriting:
token views borrow the input (no per-token allocation), and every byte
of the input belongs to exactly one token’s raw() span, so an
unmodified pass re-serializes to byte-identical output.
Unlike a DOM parser it builds no tree and decodes no character references — text and attribute values are exposed as raw bytes.
It is resumable (write + end) and handles HTML text modes
(<script> / <style> / <textarea> / <title> / <plaintext> / …)
plus the SVG/MathML foreign-content context needed to distinguish real
CDATA from bogus comments. The identity property holds for all input.
Structs§
- Attribute
- A single attribute view.
- Attributes
- Iterator over a
StartTag’s attributes. - Cdata
- A CDATA section, e.g.
<![CDATA[ x ]]>(only emitted inside foreign content — SVG/MathML; elsewhere<![CDATA[is a bogus comment). - Comment
- A comment, e.g.
<!-- hi -->. - Doctype
- A document type declaration, e.g.
<!DOCTYPE html>. - EndTag
- An end tag, e.g.
</a>. - Local
Name Hash - A 64-bit hash of an ASCII-lowercased tag name.
- Parsing
Ambiguity Error - Raised when text-mode context can’t be determined in strict mode.
- Start
Tag - A start tag, e.g.
<a href="/x">or<br/>. - Text
- A run of character data (text). Raw bytes, not entity-decoded.
- Tokenizer
- A byte-faithful, low-allocation, resumable HTML tokenizer.
Enums§
- HtmlTag
- A classified HTML tag name.
Traits§
- Token
Sink - Receives token events as the tokenizer scans HTML.
Functions§
- tokenize
- Tokenizes
inputin one pass with the default (non-strict) tokenizer, dispatching events tosink.