Bugzilla – Bug 666
Different resolution of invalid attribute names
Last modified: 2009-10-06 10:17:34 CEST
I'm parsing invalid HTML in ALTER_INFOSET mode, relying on parser to fix the input to well-formed XML. Upgrading from nu.validation.htmlparser 1.2.0 to version 1.2.1 changed the parser's fix-up behavior. I'm not sure if it was a bug-fix or it is a new bug in 1.2.1. My test input is: <meta http-equiv="Content-Type" content="text/html; charset=windows-1250"> <h1>hlavička</h1><br> <a class=lnm href=hladaj_osoba.asp?PR=Friča, vznik funkcie:22.08.2001 zánik 07.07.2003&MENO=Štefan&SID=0&T=f0&R=1> Output in version 1.2.0 was: <body xmlns="http://www.w3.org/1999/xhtml"> <h1 xmlns="http://www.w3.org/1999/xhtml">hlavička</h1> <br xmlns="http://www.w3.org/1999/xhtml"/> <a U0003007.07.2003U000260meno="Štefan&SID=0&T=f0&R=1" class="lnm" funkcieU0003A022.08.2001="" href="hladaj_osoba.asp?PR=Friča," vznik="" zánik="" xmlns="http://www.w3.org/1999/xhtml"/> </body> Output in version 1.2.1 was: <body xmlns="http://www.w3.org/1999/xhtml"> <h1 xmlns="http://www.w3.org/1999/xhtml">hlavička</h1> <br xmlns="http://www.w3.org/1999/xhtml"/> <a U0000307.07.2003U000026meno="Štefan&SID=0&T=f0&R=1" class="lnm" funkcieU00003A22.08.2001="" href="hladaj_osoba.asp?PR=Friča," vznik="" zánik="" xmlns="http://www.w3.org/1999/xhtml"/> </body> The difference is in how the attributes which start with numbers are converted to valid QNames. funkcieU0003A022.08.2001="" in 1.2.0 funkcieU00003A22.08.2001="" in 1.2.1 U0003007.07.2003U000260meno in 1.2.0 U0000307.07.2003U000026meno in 1.2.1 I was unable to locate where this fix-up happens... I got lost inside the Tokenizer. I also don't know what is the correct behavior. Is this a bug?
Now I see the values from version 1.2.1 are unicode escapes of the offending characters from input. So this was apparently a bug fix. Sorry for spam. :-)