Bugzilla – Bug 100
Make UTF-16 turn to UTF-8 if the encoding is detected in an ASCII-compatible manner. Clarify some other things in the encoding detection algorithm.
Last modified: 2008-03-20 16:11:39 CET
Index: source =================================================================== --- source (revision 1271) +++ source (revision 1272) @@ -35619,53 +35619,33 @@ sniffed, then skip this inner set of steps, and jump to the second step in the overall "two step" algorithm.</p></li> - <li><p>Examine the attribute's name:</p> + <li><p>If the attribute's name is neither "<code + title="">charset</code>" nor "<code title="">content</code>", + then return to step 2 in these inner steps.</p></li> + + <li><p>If the attribute's name is "<code + title="">charset</code>", let <var title="">charset</var> be + the attribute's value, interpreted as a character + encoding.</p></li> + + <li><p>Otherwise, the attribute's name is "<code + title="">content</code>": apply the <span>algorithm for + extracting an encoding from a Content-Type</span>, giving the + attribute's value as the string to parse. If an encoding is + returned, let <var title="">charset</var> be that + encoding. Otherwise, return to step 2 in these inner + steps.</li> + + <p>If <var title="">charset</var> is a UTF-16 encoding, + change it to UTF-8.</p> + + <p>If <var title="">charset</var> is a supported character + encoding, then return the given encoding, with <span + title="concept-encoding-confidence">confidence</span> + <i>tentative</i>, and abort all these steps.</p> - <dl class="switch"> - - <dt>If it is 'charset'</dt> - - <dd><p>If the attribute's value is a supported character - encoding, then return the given encoding, with <span - title="concept-encoding-confidence">confidence</span> - <i>tentative</i>, and abort all these steps. Otherwise, do - nothing with this attribute, and continue looking for other - attributes.</p></dd> - - <dt>If it is 'content'</dt> - - <dd> - - <p>The attribute's value is now parsed.</p> - - <ol> - - <li>Apply the <span>algorithm for extracting an encoding - from a Content-Type</span>, giving the attribute's value - as the string to parse.</li> - - <li>If an encoding was returned, and it is the name of a - supported character encoding, then return that encoding, - with the <span - title="concept-encoding-confidence">confidence</span> - <i>tentative</i>, and abort all these steps.</li> - - <li>Otherwise, skip this 'content' attribute and continue - on with any other attributes.</li> - - </ol> - - <dd> - - <dt>Any other name</dt> - - <dd><p>Do nothing with that attribute.</p></dd> - - </dl> - - </li> - - <li><p>Return to step 1 in these inner steps.</p></li> + <li><p>Otherwise, return to step 2 in these inner + steps.</p></li> </ol> @@ -35732,7 +35712,7 @@ this substep.</p></li> <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII - '>'), then stop looking for an attribute. There isn't + '>'), then abort the "get an attribute" algorithm. There isn't one.</p></li> <li><p>Otherwise, the byte at <var title="">position</var> is the @@ -35759,9 +35739,9 @@ <dt>If it is 0x2F (ASCII '/') or 0x3E (ASCII '>')</dt> - <dd>Stop looking for an attribute. The attribute's name is the - value of <var title="">attribute name</var>, its value is the - empty string.</dd> + <dd>Abort the "get an attribute" algorithm. The attribute's + name is the value of <var title="">attribute name</var>, its + value is the empty string.</dd> <dt>If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII 'Z')</dt> @@ -35794,8 +35774,8 @@ next byte, then, repeat this step.</p></li> <li><p>If the byte at <var title="">position</var> is - <em>not</em> 0x3D (ASCII '='), stop looking for an - attribute. Move <var title="">position</var> back to the previous + <em>not</em> 0x3D (ASCII '='), abort the "get an attribute" + algorithm. Move <var title="">position</var> back to the previous byte. The attribute's name is the value of <var title="">attribute name</var>, its value is the empty string.</p></li> @@ -35827,10 +35807,10 @@ byte.</li> <li>If the value of the byte at <var title="">position</var> - is the value of <var title="">b</var>, then stop looking for - an attribute. The attribute's name is the value of <var - title="">attribute name</var>, and its value is the value of - <var title="">attribute value</var>.</li> + is the value of <var title="">b</var>, then abort the "get an + attribute" algorithm. The attribute's name is the value of + <var title="">attribute name</var>, and its value is the + value of <var title="">attribute value</var>.</li> <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII 'A') to @@ -35851,9 +35831,9 @@ <dt>If it is 0x3E (ASCII '>')</dt> - <dd>Stop looking for an attribute. The attribute's name is the - value of <var title="">attribute name</var>, its value is the - empty string.</dd> + <dd>Abort the "get an attribute" algorithm. The attribute's + name is the value of <var title="">attribute name</var>, its + value is the empty string.</dd> <dt>If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII @@ -35885,9 +35865,9 @@ VT), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII '>')</dt> - <dd>Stop looking for an attribute. The attribute's name is the - value of <var title="">attribute name</var> and its value is the - value of <var title="">attribute value</var>.</dd> + <dd>Abort the "get an attribute" algorithm. The attribute's + name is the value of <var title="">attribute name</var> and its + value is the value of <var title="">attribute value</var>.</dd> <dt>If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII 'Z')</dt>