Bug 666 – Different resolution of invalid attribute names

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 666 - Different resolution of invalid attribute names


Summary:	Different resolution of invalid attribute names

Status:	RESOLVED INTENTIONAL

Product:	Validator.nu
Classification:	Unclassified
Component:	HTML parser
Version:	HEAD
Hardware:	All All

Importance:	P2 normal
Assigned To:	Henri Sivonen

URL:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-10-06 02:33 CEST by Pavol Vaskovic
Modified:	2009-10-06 10:17 CEST (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pavol Vaskovic 2009-10-06 02:33:21 CEST

I'm parsing invalid HTML in ALTER_INFOSET mode, relying on parser to fix the input to well-formed XML. Upgrading from nu.validation.htmlparser 1.2.0 to version 1.2.1 changed the parser's fix-up behavior. I'm not sure if it was a bug-fix or it is a new bug in 1.2.1.

My test input is:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1250">
<h1>hlavička</h1><br>
<a class=lnm href=hladaj_osoba.asp?PR=Friča, vznik funkcie:22.08.2001 zánik 07.07.2003&MENO=Štefan&SID=0&T=f0&R=1>

Output in version 1.2.0 was:

<body xmlns="http://www.w3.org/1999/xhtml">
  <h1 xmlns="http://www.w3.org/1999/xhtml">hlavička</h1>
  <br xmlns="http://www.w3.org/1999/xhtml"/>
  <a U0003007.07.2003U000260meno="Štefan&amp;SID=0&amp;T=f0&amp;R=1" class="lnm" funkcieU0003A022.08.2001="" href="hladaj_osoba.asp?PR=Friča," vznik="" zánik="" xmlns="http://www.w3.org/1999/xhtml"/>
</body>

Output in version 1.2.1 was:
<body xmlns="http://www.w3.org/1999/xhtml">
  <h1 xmlns="http://www.w3.org/1999/xhtml">hlavička</h1>
  <br xmlns="http://www.w3.org/1999/xhtml"/>
  <a U0000307.07.2003U000026meno="Štefan&amp;SID=0&amp;T=f0&amp;R=1" class="lnm" funkcieU00003A22.08.2001="" href="hladaj_osoba.asp?PR=Friča," vznik="" zánik="" xmlns="http://www.w3.org/1999/xhtml"/>
</body>

The difference is in how the attributes which start with numbers are converted to valid QNames.

funkcieU0003A022.08.2001="" in 1.2.0
funkcieU00003A22.08.2001="" in 1.2.1

U0003007.07.2003U000260meno in 1.2.0
U0000307.07.2003U000026meno in 1.2.1

I was unable to locate where this fix-up happens... I got lost inside the Tokenizer. I also don't know what is the correct behavior. Is this a bug?

Comment 1 Pavol Vaskovic 2009-10-06 10:17:34 CEST

Now I see the values from version 1.2.1 are unicode escapes of the offending characters from input. So this was apparently a bug fix. Sorry for spam. :-)