NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 100 - Make UTF-16 turn to UTF-8 if the encoding is detected in an ASCII-compatible manner. Clarify some other things in the encoding detection algorithm.
Make UTF-16 turn to UTF-8 if the encoding is detected in an ASCII-compatible ...
Status: RESOLVED FIXED
Product: Validator.nu
Classification: Unclassified
Component: HTML parser
HEAD
All All
: P2 normal
Assigned To: Henri Sivonen
http://svn.whatwg.org/webapps/source?...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-03 13:11 CET by Nobody
Modified: 2008-03-20 16:11 CET (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nobody 2008-03-03 13:11:00 CET
Index: source
===================================================================
--- source	(revision 1271)
+++ source	(revision 1272)
@@ -35619,53 +35619,33 @@
          sniffed, then skip this inner set of steps, and jump to the
          second step in the overall "two step" algorithm.</p></li>
 
-         <li><p>Examine the attribute's name:</p>
+         <li><p>If the attribute's name is neither "<code
+         title="">charset</code>" nor "<code title="">content</code>",
+         then return to step 2 in these inner steps.</p></li>
+
+         <li><p>If the attribute's name is "<code
+         title="">charset</code>", let <var title="">charset</var> be
+         the attribute's value, interpreted as a character
+         encoding.</p></li>
+
+         <li><p>Otherwise, the attribute's name is "<code
+         title="">content</code>": apply the <span>algorithm for
+         extracting an encoding from a Content-Type</span>, giving the
+         attribute's value as the string to parse. If an encoding is
+         returned, let <var title="">charset</var> be that
+         encoding. Otherwise, return to step 2 in these inner
+         steps.</li>
+
+         <p>If <var title="">charset</var> is a UTF-16 encoding,
+         change it to UTF-8.</p>
+
+         <p>If <var title="">charset</var> is a supported character
+         encoding, then return the given encoding, with <span
+         title="concept-encoding-confidence">confidence</span>
+         <i>tentative</i>, and abort all these steps.</p>
 
-          <dl class="switch">
-
-           <dt>If it is 'charset'</dt>
-
-           <dd><p>If the attribute's value is a supported character
-           encoding, then return the given encoding, with <span
-           title="concept-encoding-confidence">confidence</span>
-           <i>tentative</i>, and abort all these steps. Otherwise, do
-           nothing with this attribute, and continue looking for other
-           attributes.</p></dd>
-
-           <dt>If it is 'content'</dt>
-
-           <dd>
-
-            <p>The attribute's value is now parsed.</p>
-
-            <ol>
-
-             <li>Apply the <span>algorithm for extracting an encoding
-             from a Content-Type</span>, giving the attribute's value
-             as the string to parse.</li>
-
-             <li>If an encoding was returned, and it is the name of a
-             supported character encoding, then return that encoding,
-             with the <span
-             title="concept-encoding-confidence">confidence</span>
-             <i>tentative</i>, and abort all these steps.</li>
-
-             <li>Otherwise, skip this 'content' attribute and continue
-             on with any other attributes.</li>
-
-            </ol>
-
-           <dd>
-
-           <dt>Any other name</dt>
-
-           <dd><p>Do nothing with that attribute.</p></dd>
-
-          </dl>
-
-         </li>
-
-         <li><p>Return to step 1 in these inner steps.</p></li>
+         <li><p>Otherwise, return to step 2 in these inner
+         steps.</p></li>
 
         </ol>
 
@@ -35732,7 +35712,7 @@
      this substep.</p></li>
 
      <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
-     '>'), then stop looking for an attribute. There isn't
+     '>'), then abort the "get an attribute" algorithm. There isn't
      one.</p></li>
 
      <li><p>Otherwise, the byte at <var title="">position</var> is the
@@ -35759,9 +35739,9 @@
 
        <dt>If it is 0x2F (ASCII '/') or 0x3E (ASCII '>')</dt>
 
-       <dd>Stop looking for an attribute. The attribute's name is the
-       value of <var title="">attribute name</var>, its value is the
-       empty string.</dd>
+       <dd>Abort the "get an attribute" algorithm. The attribute's
+       name is the value of <var title="">attribute name</var>, its
+       value is the empty string.</dd>
 
        <dt>If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII
        'Z')</dt>
@@ -35794,8 +35774,8 @@
      next byte, then, repeat this step.</p></li>
 
      <li><p>If the byte at <var title="">position</var> is
-     <em>not</em> 0x3D (ASCII '='), stop looking for an
-     attribute. Move <var title="">position</var> back to the previous
+     <em>not</em> 0x3D (ASCII '='), abort the "get an attribute"
+     algorithm. Move <var title="">position</var> back to the previous
      byte. The attribute's name is the value of <var
      title="">attribute name</var>, its value is the empty
      string.</p></li>
@@ -35827,10 +35807,10 @@
          byte.</li>
 
          <li>If the value of the byte at <var title="">position</var>
-         is the value of <var title="">b</var>, then stop looking for
-         an attribute. The attribute's name is the value of <var
-         title="">attribute name</var>, and its value is the value of
-         <var title="">attribute value</var>.</li>
+         is the value of <var title="">b</var>, then abort the "get an
+         attribute" algorithm. The attribute's name is the value of
+         <var title="">attribute name</var>, and its value is the
+         value of <var title="">attribute value</var>.</li>
 
          <li>Otherwise, if the value of the byte at <var
          title="">position</var> is in the range 0x41 (ASCII 'A') to
@@ -35851,9 +35831,9 @@
 
        <dt>If it is 0x3E (ASCII '>')</dt>
 
-       <dd>Stop looking for an attribute. The attribute's name is the
-       value of <var title="">attribute name</var>, its value is the
-       empty string.</dd>
+       <dd>Abort the "get an attribute" algorithm. The attribute's
+       name is the value of <var title="">attribute name</var>, its
+       value is the empty string.</dd>
 
 
        <dt>If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII
@@ -35885,9 +35865,9 @@
        VT), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or
        0x3E (ASCII '>')</dt>
 
-       <dd>Stop looking for an attribute. The attribute's name is the
-       value of <var title="">attribute name</var> and its value is the
-       value of <var title="">attribute value</var>.</dd>
+       <dd>Abort the "get an attribute" algorithm. The attribute's
+       name is the value of <var title="">attribute name</var> and its
+       value is the value of <var title="">attribute value</var>.</dd>
 
        <dt>If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII
        'Z')</dt>