NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 848 - make parser conform to current "Changing the encoding while parsing" algorithm
make parser conform to current "Changing the encoding while parsing" algorithm
Status: NEW
Product: Validator.nu
Classification: Unclassified
Component: HTML parser
HEAD
All All
: P2 normal
Assigned To: Nobody
Changing the encoding while parsing
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-07-13 10:03 CEST by Michael[tm] Smith
Modified: 2011-07-13 10:03 CEST (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael[tm] Smith 2011-07-13 10:03:05 CEST
The parser code does not conform to the current requirements in the "Changing the encoding while parsing" algorithm. It should be updated to match the current requirements there.

Details:

When conforming parsers encounter a a meta element that specifies a character encoding, the HTML5 spec requires them to run the "Changing the encoding while parsing" algorithm:

  http://dev.w3.org/html5/spec/parsing.html#changing-the-encoding-while-parsing

In the case of the v.nu parser, what that means is, the code calls the internalEncodingDeclaration(String internalCharset) method, where internalCharset is the value of the character encoding specified by the meta element.

In the current code, the first condition checked for in internalEncodingDeclaration(String internalCharset) is whether internalCharset specifies a UTF-16 encoding. If it does, the code changes the document character encoding to UTF-8 and logs a "Internal encoding declaration specified 'UTF-16' which is not an ASCII superset. Continuing as if the encoding had been 'UTF-8'." parse error.

That behavior conforms to the requirements in the "Changing the encoding while parsing" an earlier version of the spec:

  http://www.w3.org/TR/2009/WD-html5-20090423/syntax.html#changing-the-encoding-while-parsing

...in which step 1 read:

  "If the new encoding is a UTF-16 encoding, change it to UTF-8."

...but the requirements in the spec have since changed, and that step is no longer step 1, but instead step 3.