NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 961 - Format for IANA character sets have been changed
Format for IANA character sets have been changed
Status: RESOLVED FIXED
Product: Validator.nu
Classification: Unclassified
Component: Datatype library
HEAD
All All
: P2 normal
Assigned To: Henri Sivonen
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-25 13:52 CET by Peter Wu
Modified: 2013-03-20 17:52 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Wu 2013-01-25 13:52:43 CET
IANAs character sets database has changed from a plaintext format to XML (optionally text[1], not compatible with the old format though). Due to this change, a document with <meta charset="utf-8"> results in a warning. This is observed on a local build, not the validator.nu instance.

The new format[2] uses XML which is not understood by syntax/relaxng/datatype/java/src/org/whattf/datatype/data/CharsetData.java. As a temporary workaround I downloaded an old file from http://wayback.archive.org/web/20121014082804/http://www.iana.org/assignments/character-sets and put it in local-entities/www.iana.org/assignments/character-sets, but this is of course not a complete solution.

PS. this bugzilla is really slow, 5 seconds just to get a page loaded? :/

 [1]: http://www.iana.org/assignments/character-sets/character-sets.txt
 [2]: http://www.iana.org/assignments/character-sets/character-sets.xml
Comment 1 Michael[tm] Smith 2013-01-25 15:28:01 CET
Is there any reason we can't just switch the URL in the validator code to get http://www.iana.org/assignments/character-sets/character-sets.txt instead of http://www.iana.org/assignments/character-sets ? We could then just use the same parsing code we've already got, and no need to rewrite it to use the XML version.
Comment 2 Peter Wu 2013-01-25 15:55:34 CET
Parsing the text format seems more difficult to me, it is intended for humans, not computers. For example, the entry for US-ASCII:

...
  (a lot of spaces) ANSI_X3.4-1986
  (a lot of spaces) ISO_646.irv:1991
US-ASCII (spaces, name, number, source, reference) ISO646-US
  (a lot of spaces) US-ASCII
...

Entries overlap, there is no separation between the charsets unless you look at the XML page.

There is already an XML parser in use for the validator, why not use the same for parsing this document? It would also be nice if a build-test could be added that tests whether the list of charsets actually contains elements or not. The most trivial test would be one for utf-8 (assuming that this will always exist and never change).

(PS. my initial statement of 5 seconds was too positive, actually measuring the delay gives something like 20-30 seconds)
Comment 3 Henri Sivonen 2013-01-25 16:18:42 CET
We should ignore IANA and switch over to the Encoding Standard.
Comment 4 Michael[tm] Smith 2013-01-25 16:26:58 CET
(In reply to comment #3)
> We should ignore IANA and switch over to the Encoding Standard.

Yeah, agreed.

Peter, note that the HTML spec was recently updated to no longer normatively reference the IANA Character Sets registry but instead normatively reference the Encoding Standard:

http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#encoding
http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#character-encoding-declaration
http://encoding.spec.whatwg.org/

Note in particular the parts of the HTML spec that now say:

"A character encoding, or just encoding where that is not ambiguous, is a defined way to convert between byte streams and Unicode strings, as defined in the WHATWG Encoding standard. An encoding has an encoding name and one or more encoding labels, referred to as the encoding's name and labels in the Encoding specification. [ENCODING]"

and

 "The character encoding name given must be an ASCII case-insensitive match for the name of the character encoding used to serialize the file. [ENCODING]"

So to be in conformance with the current HTML spec, we need to change the validator code to not use the IANA registry at all, but instead just use the (much smaller) list of allowed values that are in the Encoding Standard.