Bugzilla – Bug 639
HTML 4 and XHTML parsing too strict wrt enumerated attribute values
Last modified: 2009-09-04 15:12:36 CEST
SGML and XML perform whitespace normalization on all NMTOKEN attribute values, which includes all attributes declared with an enumeration. So the indicated document, as well as its XML equivalent http://www.ltg.ed.ac.uk/~ht/whitened-attr.xml, are valid HTML 4.01/XHTML 1.0 respectively. But, presumably because HTML5 does _not_ mandate whitespace normalization (why? -- a separate issue will be filed wrt this), validator.nu finds an error.
Browsers don't trim leading or trailing whitespace, so allowing it in the validator is not helpful for authors trying to find out why their markup isn't working in browsers. For instance, <input type=" checkbox"> will render a text field in browsers -- not a checkbox.
XML makes processing the external entities optional. Validator.nu defaults to not processing external entities, because browsers don't process external entities (except abridged fake versions in some cases). Defaulting to processing external entities would hide problems that arise when sending XML files to browsers. When external entities aren't processed, the XML processor doesn't know what attributes are NMTOKEN attributes. (This demonstrates how optional features on such a fundamental level as parsing is a bad idea.) To see the problem Simon described, I suggest changing the test case to say type=" checkbox" instead of type=" text". You can configure Validator.nu to load external entities (without performing DTD-based validation) by choosing a different setting from the Parser menu: http://validator.nu/?parser=xmldtd Now, validating gives a different error: http://validator.nu/?doc=http%3A%2F%2Fwww.ltg.ed.ac.uk%2F~ht%2Fwhitened-attr.xml&parser=xmldtd The DTD causes an attribute shape to be defaulted on <a> elements and the schema is defaulted to XHTML5+ARIA, SVG 1.1 plus MathML 2.0 (experimental) from the namespace (since dispatching on doctype would be a layering violation), but XHTML5 doesn't allow an attribute shape on the <a> element. You can, however, also override the schema manually: http://validator.nu/?doc=http%3A%2F%2Fwww.ltg.ed.ac.uk%2F~ht%2Fwhitened-attr.xml&schema=http%3A%2F%2Fs.validator.nu%2Fxhtml10%2Fxhtml-transitional.rnc+http%3A%2F%2Fs.validator.nu%2Fxhtml10%2Fxhtml.sch++http%3A%2F%2Fc.validator.nu%2Fall-html4%2F&parser=xmldtd HTML5 doesn't mandate whitespace normalization, because HTML5 tries to align with existing browser behavior for features that aren't new, and browsers don't perform whitespace normalization here.
Thanks for the clarifications. Learn something new every day -- it had simply never occurred to me that existing browsers wouldn't follow the HTML 4.01 spec (which I agree only _allows_ whitespace normalization)/wouldn't choose to stay as compatible as possible with generic XML tools by including attribute types in the "abridged fake versions" of the DOCTYPE they use. Not a big deal, since the whitespace in question is unlikely to arise intentionally or accidently. . .
(In reply to comment #3) > wouldn't choose to stay > as compatible as possible with generic XML tools by including attribute types > in the "abridged fake versions" of the DOCTYPE they use. I'd argue that making authors not include extra whitespace in attributes that take enumerated values leads to maximal compatibility with generic XML tools, since generic XML tools are permitted not to read external entities. In fact, it's a problem for generic XML tools that the abridged DTDs include character entities, since things like can "work" in a browser but fail in generic XML tools that don't process external entities.