-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Implement HTML charset parsing project
Simon Sapin edited this page Feb 17, 2020
·
3 revisions
Background information: Major browsers support parsing HTML content that does not provide an HTTP Content-Encoding
header but declares it inline in the page in a <meta>
element instead. This causes the bytes of the page to be reinterpreted in the requested character encoding. The goal of this project is to implement support for this delayed encoding interpretation in Servo as well, which will increase the number of passing tests and improve compatibility with existing web content that relies on this feature.
Tracking issue: (please ask questions in these issues)
Useful references:
- Guide to getting Servo building
- Documentation for types and modules inside Servo
Initial steps:
- email the mozilla.dev.servo mailing list (be sure to subscribe to it first!) introducing your group and asking any necessary questions
- create a new
prescan.rs
module in the html5ever repository and implement the byte stream prescanning algorithm.- add a new public function which accepts a
&[u8]
argument and returnsResult<&'static Encoding, AbortReason>
where AbortReason is an enum representingnot enough bytes
orno encoding detected within the first 1024 bytes
. - use Encoding::for_label to convert a named charset into an Encoding value
- add a new public function which accepts a
- add unit tests that cover success and failure cases for the algorithm (use
cargo test prescan
to run tests defined in the newprescan.rs
module)
Subsequent steps:
- Integrate the new prescan algorithm into Servo's HTML parser implementation following the encoding sniffing algorithm:
- add a Cargo override that uses the locally-modified version of html5ever in Servo's Cargo.toml
- modify
components/script/dom/servoparser/mod.rs
to create an enum with two states -Prescanning(Vec<u8>)
andDetected(NetworkDecoder)
, and replace thenetwork_decoder
field with this enum - in
push_bytes_input_chunk
, if the prescanning case is active then perform prescanning on any existing buffer along with the newest chunk, transitioning into the Detected phase if prescanning completes (and updating the associated Document's encoding with the detected encoding) (step 4) - if prescanning does not complete, no parsing should occur in
parse_bytes_chunk
- modify
new_inherited
to accept anOption<&'static Encoding>
argument, which is used as an override that avoids prescanning any input (step 3) - when prescanning completes with no detected encoding, check document's browsing context's parent's document's encoding (step 5)
- Verify the failing automated tests pass with the new parser changes