Bugzilla – Bug 489
version 1.2 choking on ID attribute
Last modified: 2009-04-24 15:27:13 CEST
I just upgraded from 1.0.7 to 1.2.0 of the Java API, and I can't get it to parse HTML containing any ID attributes. Here's a simple unit test I wrote: public class NuValidatorBrokenTest extends TestCase { public void testVersion12Broken() throws Exception { StringBuilder html = new StringBuilder(); html .append("<!DOCTYPE HTML PUBLIC") .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"") .append(" \"http://www.w3.org/TR/html4/loose.dtd\">") .append("<html><head>") .append("<meta http-equiv=\"content-type\"") .append(" content=\"text/html; charset=UTF-8\">") .append("</head><body>") .append("<span id=\"foo\">bar</span>.") .append("</body></html>") ; InputStream in = new ByteArrayInputStream( html.toString().getBytes(Charsets.UTF_8)); HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder( nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW); validatorBuilder.parse(new InputSource(in)); } } I also tried with different permutations of: - omitting the HTML, HEAD, and BODY tags - different settings for errorHandler, doctypeExpectation, and XmlViolationPolicy - parseFragment() instead of parse() To no avail, I always get this error: org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist. at nu.validator.htmlparser.impl.TreeBuilder.fatal(TreeBuilder.java:424) at nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:161) at nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:44) at nu.validator.htmlparser.impl.TreeBuilder.appendToCurrentNodeAndPushElementMayFoster(TreeBuilder.java:4580) at nu.validator.htmlparser.impl.TreeBuilder.startTag(TreeBuilder.java:2208) at nu.validator.htmlparser.impl.Tokenizer.emitCurrentTagToken(Tokenizer.java:1309) at nu.validator.htmlparser.impl.Tokenizer.stateLoop(Tokenizer.java:2148) at nu.validator.htmlparser.impl.Tokenizer.tokenizeBuffer(Tokenizer.java:1518) at nu.validator.htmlparser.io.Driver.runStates(Driver.java:301) at nu.validator.htmlparser.io.Driver.tokenize(Driver.java:217) at nu.validator.htmlparser.dom.HtmlDocumentBuilder.tokenize(HtmlDocumentBuilder.java:405) at nu.validator.htmlparser.dom.HtmlDocumentBuilder.parse(HtmlDocumentBuilder.java:204) at com.google.rewritely.htmlsecurity.NuValidatorBrokenTest.testVersion12Broken(NuValidatorBrokenTest.java:37)
I should add, if I remove the id="foo" attribute, it works correctly. (In reply to comment #0) > I just upgraded from 1.0.7 to 1.2.0 of the Java API, and I can't get it to > parse HTML containing any ID attributes. Here's a simple unit test I wrote: > > public class NuValidatorBrokenTest extends TestCase { > public void testVersion12Broken() throws Exception { > StringBuilder html = new StringBuilder(); > html > .append("<!DOCTYPE HTML PUBLIC") > .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"") > .append(" \"http://www.w3.org/TR/html4/loose.dtd\">") > .append("<html><head>") > .append("<meta http-equiv=\"content-type\"") > .append(" content=\"text/html; charset=UTF-8\">") > .append("</head><body>") > .append("<span id=\"foo\">bar</span>.") > .append("</body></html>") > ; > InputStream in = new ByteArrayInputStream( > html.toString().getBytes(Charsets.UTF_8)); > HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder( > nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW); > validatorBuilder.parse(new InputSource(in)); > } > } > > I also tried with different permutations of: > - omitting the HTML, HEAD, and BODY tags > - different settings for errorHandler, doctypeExpectation, and > XmlViolationPolicy > - parseFragment() instead of parse() > > To no avail, I always get this error: > > org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node > in a context where it does not exist. > at nu.validator.htmlparser.impl.TreeBuilder.fatal(TreeBuilder.java:424) > at > nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:161) > at > nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:44) > at > nu.validator.htmlparser.impl.TreeBuilder.appendToCurrentNodeAndPushElementMayFoster(TreeBuilder.java:4580) > at > nu.validator.htmlparser.impl.TreeBuilder.startTag(TreeBuilder.java:2208) > at > nu.validator.htmlparser.impl.Tokenizer.emitCurrentTagToken(Tokenizer.java:1309) > at > nu.validator.htmlparser.impl.Tokenizer.stateLoop(Tokenizer.java:2148) > at > nu.validator.htmlparser.impl.Tokenizer.tokenizeBuffer(Tokenizer.java:1518) > at nu.validator.htmlparser.io.Driver.runStates(Driver.java:301) > at nu.validator.htmlparser.io.Driver.tokenize(Driver.java:217) > at > nu.validator.htmlparser.dom.HtmlDocumentBuilder.tokenize(HtmlDocumentBuilder.java:405) > at > nu.validator.htmlparser.dom.HtmlDocumentBuilder.parse(HtmlDocumentBuilder.java:204) > at > com.google.rewritely.htmlsecurity.NuValidatorBrokenTest.testVersion12Broken(NuValidatorBrokenTest.java:37) >
It also works fine for any attributes other than ID. (In reply to comment #0) > I just upgraded from 1.0.7 to 1.2.0 of the Java API, and I can't get it to > parse HTML containing any ID attributes. Here's a simple unit test I wrote: > > public class NuValidatorBrokenTest extends TestCase { > public void testVersion12Broken() throws Exception { > StringBuilder html = new StringBuilder(); > html > .append("<!DOCTYPE HTML PUBLIC") > .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"") > .append(" \"http://www.w3.org/TR/html4/loose.dtd\">") > .append("<html><head>") > .append("<meta http-equiv=\"content-type\"") > .append(" content=\"text/html; charset=UTF-8\">") > .append("</head><body>") > .append("<span id=\"foo\">bar</span>.") > .append("</body></html>") > ; > InputStream in = new ByteArrayInputStream( > html.toString().getBytes(Charsets.UTF_8)); > HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder( > nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW); > validatorBuilder.parse(new InputSource(in)); > } > } > > I also tried with different permutations of: > - omitting the HTML, HEAD, and BODY tags > - different settings for errorHandler, doctypeExpectation, and > XmlViolationPolicy > - parseFragment() instead of parse() > > To no avail, I always get this error: > > org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node > in a context where it does not exist. > at nu.validator.htmlparser.impl.TreeBuilder.fatal(TreeBuilder.java:424) > at > nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:161) > at > nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:44) > at > nu.validator.htmlparser.impl.TreeBuilder.appendToCurrentNodeAndPushElementMayFoster(TreeBuilder.java:4580) > at > nu.validator.htmlparser.impl.TreeBuilder.startTag(TreeBuilder.java:2208) > at > nu.validator.htmlparser.impl.Tokenizer.emitCurrentTagToken(Tokenizer.java:1309) > at > nu.validator.htmlparser.impl.Tokenizer.stateLoop(Tokenizer.java:2148) > at > nu.validator.htmlparser.impl.Tokenizer.tokenizeBuffer(Tokenizer.java:1518) > at nu.validator.htmlparser.io.Driver.runStates(Driver.java:301) > at nu.validator.htmlparser.io.Driver.tokenize(Driver.java:217) > at > nu.validator.htmlparser.dom.HtmlDocumentBuilder.tokenize(HtmlDocumentBuilder.java:405) > at > nu.validator.htmlparser.dom.HtmlDocumentBuilder.parse(HtmlDocumentBuilder.java:204) > at > com.google.rewritely.htmlsecurity.NuValidatorBrokenTest.testVersion12Broken(NuValidatorBrokenTest.java:37) >
Which DOM implementation are you using? Your test case works for me with Xerces 2.6.2 DOM implementation. This program: public class NuValidatorBrokenTest { public static void main(String[] args) throws SAXException, IOException { System.out.println(Version.getVersion()); StringBuilder html = new StringBuilder(); html .append("<!DOCTYPE HTML PUBLIC") .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"") .append(" \"http://www.w3.org/TR/html4/loose.dtd\">") .append("<html><head>") .append("<meta http-equiv=\"content-type\"") .append(" content=\"text/html; charset=UTF-8\">") .append("</head><body>") .append("<span id=\"foo\">bar</span>.") .append("</body></html>") ; InputStream in = new ByteArrayInputStream( html.toString().getBytes("UTF-8")); HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder( nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW); Document doc = validatorBuilder.parse(new InputSource(in)); System.out.println(doc.getElementById("foo").getLocalName()); System.out.println("end"); } } Prints: Xerces-J 2.6.2 span end
Xerces Version = Xerces-J 2.9.1 Also, if I run this same code with nu.validator 1.0.7, I don't get the exception, but getElementById("foo") returns null. (In reply to comment #3) > Which DOM implementation are you using? Your test case works for me with Xerces > 2.6.2 DOM implementation. > > This program: > public class NuValidatorBrokenTest { > public static void main(String[] args) throws SAXException, IOException { > System.out.println(Version.getVersion()); > StringBuilder html = new StringBuilder(); > html > .append("<!DOCTYPE HTML PUBLIC") > .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"") > .append(" \"http://www.w3.org/TR/html4/loose.dtd\">") > .append("<html><head>") > .append("<meta http-equiv=\"content-type\"") > .append(" content=\"text/html; charset=UTF-8\">") > .append("</head><body>") > .append("<span id=\"foo\">bar</span>.") > .append("</body></html>") > ; > InputStream in = new ByteArrayInputStream( > html.toString().getBytes("UTF-8")); > HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder( > nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW); > Document doc = validatorBuilder.parse(new InputSource(in)); > System.out.println(doc.getElementById("foo").getLocalName()); > System.out.println("end"); > } > } > > Prints: > Xerces-J 2.6.2 > span > end >
1.0.7 didn't assign DOM IDness to id. 1.2.0 does.
Fixed in SVN. The problem was the new Xerces no longer treats "" as equivalent to null as the namespace in setIdAttributeNS().