NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 95 - Make using a Win1252-specific byte when the document declared as ISO-8859-1 be a parse error.
Make using a Win1252-specific byte when the document declared as ISO-8859-1 b...
Status: NEW
Product: Validator.nu
Classification: Unclassified
Component: HTML parser
HEAD
All All
: P2 normal
Assigned To: Nobody
http://svn.whatwg.org/webapps/source?...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-03 13:09 CET by Nobody
Modified: 2011-10-15 13:38 CEST (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nobody 2008-03-03 13:09:41 CET
Index: source
===================================================================
--- source	(revision 1263)
+++ source	(revision 1264)
@@ -35965,10 +35965,13 @@
   href="#refsIANACHARSET">[IANACHARSET]</a></p>
 
   <p>When a user agent would otherwise use the ISO-8859-1 encoding, it
-  must instead use the Windows-1252 encoding.</p>
+  must instead use the Windows-1252 encoding, except that any bytes in
+  the range 0x80 to 0x9F must, in addition to being interpreted as per
+  the Windows-1252 encoding, be considered <span title="parse
+  error">parse errors</span>.</p>
 
-  <p class="note">This requirement is a willful violation of the W3C
-  Character Model specification. <a
+  <p class="note">The requirement to treat ISO-8859-1 as Windows-1252
+  is a willful violation of the W3C Character Model specification. <a
   href="#refsCHARMOD">[CHARMOD]</a></p>
 
   <p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
Comment 1 Henri Sivonen 2008-03-16 17:26:51 CET
Potential WONTFIX.
Comment 2 Jukka K. Korpela 2011-10-15 13:38:15 CEST
The current wording at
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0
does not seem to contain any requirement on treating some data as parse errors. It simply states that iso-8859-1 is to be treated as windows-1252.

However, I think it would be most useful to issue a warning about an octet that does not represent an allowed character by iso-8859-1, even if it is defined in window-1252.

Although browsers almost uniformly treat iso-8859-1 as windows-1252, it's still a protocol error to declare a document as iso-8859-1 encoded when it is not. Moreover, such a situation often reflects an accidental error (e.g., the author entered a character he didn't mean to enter), and if the author in fact meant to enter non-iso-8859-1 windows-1252 characters, he can and should fix this by changing the character encoding information.