NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 890 - Should be a conformance error if a UTF-16LE/BE file begins with the BOM
Should be a conformance error if a UTF-16LE/BE file begins with the BOM
Status: NEW
Product: Validator.nu
Classification: Unclassified
Component: HTML parser
HEAD
All All
: P2 normal
Assigned To: Nobody
http://malform.no/testing/utf/html/16...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-29 07:16 CET by Leif Halvard Silli
Modified: 2011-12-29 07:21 CET (History)
0 users

See Also:


Attachments
File with HTTP level charset=utf-16be plus the "BOM". (350 bytes, text/html;charset=utf-16be)
2011-12-29 07:16 CET, Leif Halvard Silli
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2011-12-29 07:16:46 CET
Created attachment 215 [details]
File with HTTP level charset=utf-16be plus the "BOM". 

PROPOSAL: To include the BOM in a file sent (at HTTP level) with either the UTF-16LE label or the UTF-16BE label, should be seen as an error. The validator should inform about the error and recommend using 'utf-16' instead. (Theoretically, one could recommend removing the BOM too, but I am reluctant to do that.)

BACKGROUND:

The UTF-16 specs states: <http://tools.ietf.org/html/rfc2781#section-3.3>

'Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.'
'Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.'

(Note that the above does not mean that the *text content* cannot begin with the BOM - it only means that one must add a BOM *before* the content.)

As matter of fact, however, Web browsers do accept files which contains the BOM before the DOCTYPE even if the file, at HTTP level, is labelled UTF-16LE/BE. Some browsers (IE/Webkit) do *read* the BOM of such files - before removing it from the output (to avoid quirks-mode). While other browsers (Opera/Firefox) simply removes it from the output, without reading it. Both methods indicates that the characters is treated as a BOM since, per the UTF-16, the 

Effectively, this means that browsers treat UTF-16LE/BE as mislabeled UTF-16 - since otherwise, it would not be justified to remove the BOM. And the disadvantages to not doing it this way are that the page ought to be placed in quirks-mode due to the illegal BOM character before the DOCTYPE.
Comment 1 Leif Halvard Silli 2011-12-29 07:21:56 CET
(In reply to comment #0)

>  - it only means that one must add a BOM *before* the content.)

Sorry. Typo. Meant: "one must NOT add a BOM *before* the content.