Bug 489 – version 1.2 choking on ID attribute

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 489 - version 1.2 choking on ID attribute


Summary:	version 1.2 choking on ID attribute

Status:	RESOLVED FIXED

Product:	Validator.nu
Classification:	Unclassified
Component:	HTML parser
Version:	HEAD
Hardware:	All All

Importance:	P2 major
Assigned To:	Henri Sivonen

URL:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-04-18 01:00 CEST by Philip Tucker
Modified:	2009-04-24 15:27 CEST (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Philip Tucker 2009-04-18 01:00:45 CEST

I just upgraded from 1.0.7 to 1.2.0 of the Java API, and I can't get it to parse HTML containing any ID attributes. Here's a simple unit test I wrote:

public class NuValidatorBrokenTest extends TestCase {
  public void testVersion12Broken() throws Exception {
    StringBuilder html = new StringBuilder();
    html
    .append("<!DOCTYPE HTML PUBLIC")
    .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"")
    .append(" \"http://www.w3.org/TR/html4/loose.dtd\">")
    .append("<html><head>")
    .append("<meta http-equiv=\"content-type\"")
    .append(" content=\"text/html; charset=UTF-8\">")
    .append("</head><body>")
    .append("<span id=\"foo\">bar</span>.")
    .append("</body></html>")
    ;
    InputStream in = new ByteArrayInputStream(
        html.toString().getBytes(Charsets.UTF_8));
    HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder(
        nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW);
    validatorBuilder.parse(new InputSource(in));
  }
}

I also tried with different permutations of:
- omitting the HTML, HEAD, and BODY tags
- different settings for errorHandler, doctypeExpectation, and XmlViolationPolicy
- parseFragment() instead of parse()

To no avail, I always get this error:

org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist.
        at nu.validator.htmlparser.impl.TreeBuilder.fatal(TreeBuilder.java:424)
        at nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:161)
        at nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:44)
        at nu.validator.htmlparser.impl.TreeBuilder.appendToCurrentNodeAndPushElementMayFoster(TreeBuilder.java:4580)
        at nu.validator.htmlparser.impl.TreeBuilder.startTag(TreeBuilder.java:2208)
        at nu.validator.htmlparser.impl.Tokenizer.emitCurrentTagToken(Tokenizer.java:1309)
        at nu.validator.htmlparser.impl.Tokenizer.stateLoop(Tokenizer.java:2148)
        at nu.validator.htmlparser.impl.Tokenizer.tokenizeBuffer(Tokenizer.java:1518)
        at nu.validator.htmlparser.io.Driver.runStates(Driver.java:301)
        at nu.validator.htmlparser.io.Driver.tokenize(Driver.java:217)
        at nu.validator.htmlparser.dom.HtmlDocumentBuilder.tokenize(HtmlDocumentBuilder.java:405)
        at nu.validator.htmlparser.dom.HtmlDocumentBuilder.parse(HtmlDocumentBuilder.java:204)
        at com.google.rewritely.htmlsecurity.NuValidatorBrokenTest.testVersion12Broken(NuValidatorBrokenTest.java:37)

Comment 1 Philip Tucker 2009-04-18 01:01:43 CEST

I should add, if I remove the id="foo" attribute, it works correctly.

(In reply to comment #0)
> I just upgraded from 1.0.7 to 1.2.0 of the Java API, and I can't get it to
> parse HTML containing any ID attributes. Here's a simple unit test I wrote:
> 
> public class NuValidatorBrokenTest extends TestCase {
>   public void testVersion12Broken() throws Exception {
>     StringBuilder html = new StringBuilder();
>     html
>     .append("<!DOCTYPE HTML PUBLIC")
>     .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"")
>     .append(" \"http://www.w3.org/TR/html4/loose.dtd\">")
>     .append("<html><head>")
>     .append("<meta http-equiv=\"content-type\"")
>     .append(" content=\"text/html; charset=UTF-8\">")
>     .append("</head><body>")
>     .append("<span id=\"foo\">bar</span>.")
>     .append("</body></html>")
>     ;
>     InputStream in = new ByteArrayInputStream(
>         html.toString().getBytes(Charsets.UTF_8));
>     HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder(
>         nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW);
>     validatorBuilder.parse(new InputSource(in));
>   }
> }
> 
> I also tried with different permutations of:
> - omitting the HTML, HEAD, and BODY tags
> - different settings for errorHandler, doctypeExpectation, and
> XmlViolationPolicy
> - parseFragment() instead of parse()
> 
> To no avail, I always get this error:
> 
> org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node
> in a context where it does not exist.
>         at nu.validator.htmlparser.impl.TreeBuilder.fatal(TreeBuilder.java:424)
>         at
> nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:161)
>         at
> nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:44)
>         at
> nu.validator.htmlparser.impl.TreeBuilder.appendToCurrentNodeAndPushElementMayFoster(TreeBuilder.java:4580)
>         at
> nu.validator.htmlparser.impl.TreeBuilder.startTag(TreeBuilder.java:2208)
>         at
> nu.validator.htmlparser.impl.Tokenizer.emitCurrentTagToken(Tokenizer.java:1309)
>         at
> nu.validator.htmlparser.impl.Tokenizer.stateLoop(Tokenizer.java:2148)
>         at
> nu.validator.htmlparser.impl.Tokenizer.tokenizeBuffer(Tokenizer.java:1518)
>         at nu.validator.htmlparser.io.Driver.runStates(Driver.java:301)
>         at nu.validator.htmlparser.io.Driver.tokenize(Driver.java:217)
>         at
> nu.validator.htmlparser.dom.HtmlDocumentBuilder.tokenize(HtmlDocumentBuilder.java:405)
>         at
> nu.validator.htmlparser.dom.HtmlDocumentBuilder.parse(HtmlDocumentBuilder.java:204)
>         at
> com.google.rewritely.htmlsecurity.NuValidatorBrokenTest.testVersion12Broken(NuValidatorBrokenTest.java:37)
>

Comment 2 Philip Tucker 2009-04-18 01:21:26 CEST

It also works fine for any attributes other than ID.

(In reply to comment #0)
> I just upgraded from 1.0.7 to 1.2.0 of the Java API, and I can't get it to
> parse HTML containing any ID attributes. Here's a simple unit test I wrote:
> 
> public class NuValidatorBrokenTest extends TestCase {
>   public void testVersion12Broken() throws Exception {
>     StringBuilder html = new StringBuilder();
>     html
>     .append("<!DOCTYPE HTML PUBLIC")
>     .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"")
>     .append(" \"http://www.w3.org/TR/html4/loose.dtd\">")
>     .append("<html><head>")
>     .append("<meta http-equiv=\"content-type\"")
>     .append(" content=\"text/html; charset=UTF-8\">")
>     .append("</head><body>")
>     .append("<span id=\"foo\">bar</span>.")
>     .append("</body></html>")
>     ;
>     InputStream in = new ByteArrayInputStream(
>         html.toString().getBytes(Charsets.UTF_8));
>     HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder(
>         nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW);
>     validatorBuilder.parse(new InputSource(in));
>   }
> }
> 
> I also tried with different permutations of:
> - omitting the HTML, HEAD, and BODY tags
> - different settings for errorHandler, doctypeExpectation, and
> XmlViolationPolicy
> - parseFragment() instead of parse()
> 
> To no avail, I always get this error:
> 
> org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node
> in a context where it does not exist.
>         at nu.validator.htmlparser.impl.TreeBuilder.fatal(TreeBuilder.java:424)
>         at
> nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:161)
>         at
> nu.validator.htmlparser.dom.DOMTreeBuilder.createElement(DOMTreeBuilder.java:44)
>         at
> nu.validator.htmlparser.impl.TreeBuilder.appendToCurrentNodeAndPushElementMayFoster(TreeBuilder.java:4580)
>         at
> nu.validator.htmlparser.impl.TreeBuilder.startTag(TreeBuilder.java:2208)
>         at
> nu.validator.htmlparser.impl.Tokenizer.emitCurrentTagToken(Tokenizer.java:1309)
>         at
> nu.validator.htmlparser.impl.Tokenizer.stateLoop(Tokenizer.java:2148)
>         at
> nu.validator.htmlparser.impl.Tokenizer.tokenizeBuffer(Tokenizer.java:1518)
>         at nu.validator.htmlparser.io.Driver.runStates(Driver.java:301)
>         at nu.validator.htmlparser.io.Driver.tokenize(Driver.java:217)
>         at
> nu.validator.htmlparser.dom.HtmlDocumentBuilder.tokenize(HtmlDocumentBuilder.java:405)
>         at
> nu.validator.htmlparser.dom.HtmlDocumentBuilder.parse(HtmlDocumentBuilder.java:204)
>         at
> com.google.rewritely.htmlsecurity.NuValidatorBrokenTest.testVersion12Broken(NuValidatorBrokenTest.java:37)
>

Comment 3 Henri Sivonen 2009-04-20 10:05:58 CEST

Which DOM implementation are you using? Your test case works for me with Xerces 2.6.2 DOM implementation.

This program:
public class NuValidatorBrokenTest {
      public static void main(String[] args) throws SAXException, IOException {
    System.out.println(Version.getVersion());
    StringBuilder html = new StringBuilder();
    html
    .append("<!DOCTYPE HTML PUBLIC")
    .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"")
    .append(" \"http://www.w3.org/TR/html4/loose.dtd\">")
    .append("<html><head>")
    .append("<meta http-equiv=\"content-type\"")
    .append(" content=\"text/html; charset=UTF-8\">")
    .append("</head><body>")
    .append("<span id=\"foo\">bar</span>.")
    .append("</body></html>")
    ;
    InputStream in = new ByteArrayInputStream(
        html.toString().getBytes("UTF-8"));
    HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder(
        nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW);
    Document doc = validatorBuilder.parse(new InputSource(in));
    System.out.println(doc.getElementById("foo").getLocalName());
    System.out.println("end");
  }
}

Prints:
Xerces-J 2.6.2
span
end

Comment 4 Philip Tucker 2009-04-21 21:48:15 CEST

Xerces Version = Xerces-J 2.9.1

Also, if I run this same code with nu.validator 1.0.7, I don't get the exception, but getElementById("foo") returns null.

(In reply to comment #3)
> Which DOM implementation are you using? Your test case works for me with Xerces
> 2.6.2 DOM implementation.
> 
> This program:
> public class NuValidatorBrokenTest {
>       public static void main(String[] args) throws SAXException, IOException {
>     System.out.println(Version.getVersion());
>     StringBuilder html = new StringBuilder();
>     html
>     .append("<!DOCTYPE HTML PUBLIC")
>     .append(" \"-//W3C//DTD HTML 4.01 Transitional//EN\"")
>     .append(" \"http://www.w3.org/TR/html4/loose.dtd\">")
>     .append("<html><head>")
>     .append("<meta http-equiv=\"content-type\"")
>     .append(" content=\"text/html; charset=UTF-8\">")
>     .append("</head><body>")
>     .append("<span id=\"foo\">bar</span>.")
>     .append("</body></html>")
>     ;
>     InputStream in = new ByteArrayInputStream(
>         html.toString().getBytes("UTF-8"));
>     HtmlDocumentBuilder validatorBuilder = new HtmlDocumentBuilder(
>         nu.validator.htmlparser.common.XmlViolationPolicy.ALLOW);
>     Document doc = validatorBuilder.parse(new InputSource(in));
>     System.out.println(doc.getElementById("foo").getLocalName());
>     System.out.println("end");
>   }
> }
> 
> Prints:
> Xerces-J 2.6.2
> span
> end
>

Comment 5 Henri Sivonen 2009-04-24 15:09:21 CEST

1.0.7 didn't assign DOM IDness to id. 1.2.0 does.

Comment 6 Henri Sivonen 2009-04-24 15:27:13 CEST

Fixed in SVN. The problem was the new Xerces no longer treats "" as equivalent to null as the namespace in setIdAttributeNS().