Bug 95 – Make using a Win1252-specific byte when the document declared as ISO-8859-1 be a parse error.

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 95 - Make using a Win1252-specific byte when the document declared as ISO-8859-1 be a parse error.


Summary:	Make using a Win1252-specific byte when the document declared as ISO-8859-1 b...

Status:	NEW

Product:	Validator.nu
Classification:	Unclassified
Component:	HTML parser
Version:	HEAD
Hardware:	All All

Importance:	P2 normal
Assigned To:	Nobody

URL:	http://svn.whatwg.org/webapps/source?...

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2008-03-03 13:09 CET by Nobody
Modified:	2011-10-15 13:38 CEST (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nobody 2008-03-03 13:09:41 CET

Index: source
===================================================================
--- source	(revision 1263)
+++ source	(revision 1264)
@@ -35965,10 +35965,13 @@
   href="#refsIANACHARSET">[IANACHARSET]</a></p>
 
   <p>When a user agent would otherwise use the ISO-8859-1 encoding, it
-  must instead use the Windows-1252 encoding.</p>
+  must instead use the Windows-1252 encoding, except that any bytes in
+  the range 0x80 to 0x9F must, in addition to being interpreted as per
+  the Windows-1252 encoding, be considered <span title="parse
+  error">parse errors</span>.</p>
 
-  <p class="note">This requirement is a willful violation of the W3C
-  Character Model specification. <a
+  <p class="note">The requirement to treat ISO-8859-1 as Windows-1252
+  is a willful violation of the W3C Character Model specification. <a
   href="#refsCHARMOD">[CHARMOD]</a></p>
 
   <p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU

Comment 1 Henri Sivonen 2008-03-16 17:26:51 CET

Potential WONTFIX.

Comment 2 Jukka K. Korpela 2011-10-15 13:38:15 CEST

The current wording at
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0
does not seem to contain any requirement on treating some data as parse errors. It simply states that iso-8859-1 is to be treated as windows-1252.

However, I think it would be most useful to issue a warning about an octet that does not represent an allowed character by iso-8859-1, even if it is defined in window-1252.

Although browsers almost uniformly treat iso-8859-1 as windows-1252, it's still a protocol error to declare a document as iso-8859-1 encoded when it is not. Moreover, such a situation often reflects an accidental error (e.g., the author entered a character he didn't mean to enter), and if the author in fact meant to enter non-iso-8859-1 windows-1252 characters, he can and should fix this by changing the character encoding information.