NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 666 - Different resolution of invalid attribute names
Different resolution of invalid attribute names
Classification: Unclassified
Component: HTML parser
All All
: P2 normal
Assigned To: Henri Sivonen
Depends on:
  Show dependency treegraph
Reported: 2009-10-06 02:33 CEST by Pavol Vaskovic
Modified: 2009-10-06 10:17 CEST (History)
0 users

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Pavol Vaskovic 2009-10-06 02:33:21 CEST
I'm parsing invalid HTML in ALTER_INFOSET mode, relying on parser to fix the input to well-formed XML. Upgrading from nu.validation.htmlparser 1.2.0 to version 1.2.1 changed the parser's fix-up behavior. I'm not sure if it was a bug-fix or it is a new bug in 1.2.1.

My test input is:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1250">
<a class=lnm href=hladaj_osoba.asp?PR=Friča, vznik funkcie:22.08.2001 zánik 07.07.2003&MENO=Štefan&SID=0&T=f0&R=1>

Output in version 1.2.0 was:

<body xmlns="">
  <h1 xmlns="">hlavička</h1>
  <br xmlns=""/>
  <a U0003007.07.2003U000260meno="Štefan&amp;SID=0&amp;T=f0&amp;R=1" class="lnm" funkcieU0003A022.08.2001="" href="hladaj_osoba.asp?PR=Friča," vznik="" zánik="" xmlns=""/>

Output in version 1.2.1 was:
<body xmlns="">
  <h1 xmlns="">hlavička</h1>
  <br xmlns=""/>
  <a U0000307.07.2003U000026meno="Štefan&amp;SID=0&amp;T=f0&amp;R=1" class="lnm" funkcieU00003A22.08.2001="" href="hladaj_osoba.asp?PR=Friča," vznik="" zánik="" xmlns=""/>

The difference is in how the attributes which start with numbers are converted to valid QNames.

funkcieU0003A022.08.2001="" in 1.2.0
funkcieU00003A22.08.2001="" in 1.2.1

U0003007.07.2003U000260meno in 1.2.0
U0000307.07.2003U000026meno in 1.2.1

I was unable to locate where this fix-up happens... I got lost inside the Tokenizer. I also don't know what is the correct behavior. Is this a bug?
Comment 1 Pavol Vaskovic 2009-10-06 10:17:34 CEST
Now I see the values from version 1.2.1 are unicode escapes of the offending characters from input. So this was apparently a bug fix. Sorry for spam. :-)