Bug 639 – HTML 4 and XHTML parsing too strict wrt enumerated attribute values

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 639 - HTML 4 and XHTML parsing too strict wrt enumerated attribute values


Summary:	HTML 4 and XHTML parsing too strict wrt enumerated attribute values

Status:	RESOLVED NOTREPRODUCIBLE

Product:	Validator.nu
Classification:	Unclassified
Component:	HTML parser
Version:	HEAD
Hardware:	All All

Importance:	P2 normal
Assigned To:	Henri Sivonen

URL:	http://www.ltg.ed.ac.uk/~ht/whitened-...

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-09-03 18:56 CEST by Henry S. Thompson
Modified:	2009-09-04 15:12 CEST (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Henry S. Thompson 2009-09-03 18:56:49 CEST

SGML and XML perform whitespace normalization on all NMTOKEN attribute values, which includes all attributes declared with an enumeration.  So the indicated document, as well as its XML equivalent http://www.ltg.ed.ac.uk/~ht/whitened-attr.xml, are valid HTML 4.01/XHTML 1.0 respectively.  But, presumably because HTML5 does _not_ mandate whitespace normalization (why? -- a separate issue will be filed wrt this), validator.nu finds an error.

Comment 1 Simon Pieters 2009-09-03 21:44:38 CEST

Browsers don't trim leading or trailing whitespace, so allowing it in the validator is not helpful for authors trying to find out why their markup isn't working in browsers.

For instance, <input type=" checkbox"> will render a text field in browsers -- not a checkbox.

Comment 2 Henri Sivonen 2009-09-04 11:05:20 CEST

XML makes processing the external entities optional. Validator.nu defaults to not processing external entities, because browsers don't process external entities (except abridged fake versions in some cases). Defaulting to processing external entities would hide problems that arise when sending XML files to browsers. When external entities aren't processed, the XML processor doesn't know what attributes are NMTOKEN attributes. (This demonstrates how optional features on such a fundamental level as parsing is a bad idea.)

To see the problem Simon described, I suggest changing the test case to say type=" checkbox" instead of type=" text".

You can configure Validator.nu to load external entities (without performing DTD-based validation) by choosing a different setting from the Parser menu:
http://validator.nu/?parser=xmldtd

Now, validating gives a different error:
http://validator.nu/?doc=http%3A%2F%2Fwww.ltg.ed.ac.uk%2F~ht%2Fwhitened-attr.xml&parser=xmldtd

The DTD causes an attribute shape to be defaulted on <a> elements and the schema is defaulted to XHTML5+ARIA, SVG 1.1 plus MathML 2.0 (experimental) from the namespace (since dispatching on doctype would be a layering violation), but XHTML5 doesn't allow an attribute shape on the <a> element.

You can, however, also override the schema manually:
http://validator.nu/?doc=http%3A%2F%2Fwww.ltg.ed.ac.uk%2F~ht%2Fwhitened-attr.xml&schema=http%3A%2F%2Fs.validator.nu%2Fxhtml10%2Fxhtml-transitional.rnc+http%3A%2F%2Fs.validator.nu%2Fxhtml10%2Fxhtml.sch++http%3A%2F%2Fc.validator.nu%2Fall-html4%2F&parser=xmldtd

HTML5 doesn't mandate whitespace normalization, because HTML5 tries to align with existing browser behavior for features that aren't new, and browsers don't perform whitespace normalization here.

Comment 3 Henry S. Thompson 2009-09-04 14:48:40 CEST

Thanks for the clarifications.  Learn something new every day -- it had simply never occurred to me that existing browsers wouldn't follow the HTML 4.01 spec (which I agree only _allows_ whitespace normalization)/wouldn't choose to stay as compatible as possible with generic XML tools by including attribute types in the "abridged fake versions" of the DOCTYPE they use.

Not a big deal, since the whitespace in question is unlikely to arise intentionally or accidently. . .

Comment 4 Henri Sivonen 2009-09-04 15:12:36 CEST

(In reply to comment #3)
> wouldn't choose to stay
> as compatible as possible with generic XML tools by including attribute types
> in the "abridged fake versions" of the DOCTYPE they use.

I'd argue that making authors not include extra whitespace in attributes that take enumerated values leads to maximal compatibility with generic XML tools, since generic XML tools are permitted not to read external entities.

In fact, it's a problem for generic XML tools that the abridged DTDs include character entities, since things like &nbsp; can "work" in a browser but fail in generic XML tools that don't process external entities.