Bug 850 – Tree traversal microdata checker

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 850 - Tree traversal microdata checker


Summary:	Tree traversal microdata checker

Status:	RESOLVED FIXED

Product:	Validator.nu
Classification:	Unclassified
Component:	Non-schema checkers
Version:	HEAD
Hardware:	All All

Importance:	P2 normal
Assigned To:	Philip Jägenstedt

URL:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-07-28 00:14 CEST by Philip Jägenstedt
Modified:	2011-08-01 07:07 CEST (History)
CC List:	2 users (show)

See Also:

Attachments
validator glue (1.52 KB, patch) 2011-07-28 00:18 CEST, Philip Jägenstedt	Details
MicrodataChecker (11.78 KB, patch) 2011-07-28 00:19 CEST, Philip Jägenstedt	Details
MicrodataChecker (prettier capitalization of itemFoo) (11.78 KB, patch) 2011-07-28 02:19 CEST, Philip Jägenstedt	Details
MicrodataChecker (purge tabs) (11.95 KB, patch) 2011-07-28 02:22 CEST, Philip Jägenstedt	Details
MicrodataChecker (use ArrayDeque instead of Stack, don't need thread safety) (11.99 KB, patch) 2011-07-28 04:07 CEST, Philip Jägenstedt	Details
no more TokenList (11.87 KB, patch) 2011-07-28 22:28 CEST, Philip Jägenstedt	Details
MicrodataChecker (purge tabs again) (11.88 KB, patch) 2011-07-28 23:30 CEST, Philip Jägenstedt	Details
MicrodataChecker (refactor) (11.98 KB, patch) 2011-07-29 03:50 CEST, Philip Jägenstedt	Details
MicrodataChecker (oops) (11.76 KB, patch) 2011-07-29 04:36 CEST, Philip Jägenstedt	Details
Show Obsolete (7) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Philip Jägenstedt 2011-07-28 00:14:32 CEST

Adding support for the constraints that require tree traversal, all of which are due to itemref.

I tried adhere to the style of the existing code, including redundant braces in many places. I tried to figure out when yoda conditions [1] should be used, but failed and therefore made it inconsistent just like the pre-existing code ;)

Some error messages were very hard to write in a way that make sense, please be ruthless when reviewing that. Also, I failed to find consistency about when to use past and present tense, but tried to use past tense for what was in the markup, and present tense for conditions that failed.

I deliberately did not support these two conditions because they are weird:[2]

 - The itemref attribute, if specified, must have a value that is an unordered set of unique space-separated tokens that are case-sensitive, consisting of IDs of elements in the same home subtree.

 - The itemref attribute must not be specified on elements that do not have an itemscope attribute specified.

[1] http://stackoverflow.com/questions/2349378/new-programming-jargon-you-coined/2430307#2430307
[2] http://www.w3.org/Bugs/Public/show_bug.cgi?id=13370

Comment 1 Philip Jägenstedt 2011-07-28 00:18:52 CEST

Created attachment 201 [details]
validator glue

Comment 2 Philip Jägenstedt 2011-07-28 00:19:10 CEST

Created attachment 202 [details]
MicrodataChecker

Comment 3 Philip Jägenstedt 2011-07-28 00:34:02 CEST

Oops, the unsupported constrains are:

- If a property's value is an absolute URL, the property must be specified
using a URL property element.

- If a property's value represents a date, time, or global date and time, the
property must be specified using the datetime attribute of a time element.

Comment 4 Philip Jägenstedt 2011-07-28 02:19:32 CEST

Created attachment 203 [details]
MicrodataChecker (prettier capitalization of itemFoo)

Comment 5 Philip Jägenstedt 2011-07-28 02:22:25 CEST

Created attachment 204 [details]
MicrodataChecker (purge tabs)

Comment 6 Philip Jägenstedt 2011-07-28 04:07:12 CEST

Created attachment 205 [details]
MicrodataChecker (use ArrayDeque instead of Stack, don't need thread safety)

Comment 7 Henri Sivonen 2011-07-28 16:08:49 CEST

> I tried adhere to the style of the existing code, including redundant braces 
> in many places.

The V.nu style is to always have braces around loop and conditional blocks.

> I tried to figure out when yoda conditions [1] should be used, but
> failed and therefore made it inconsistent just like the pre-existing code ;)

Are there non-string Yoda conditions? With "foo".equals(bar), the Yoda condition deals nicely with bar being null. "foo" == bar comes out of the habit of doing .equals() string comparisons with the literal first. (Also, at one point when working on the parser code, I was assuming that string literals would translate to a special C++ class for Gecko and then the order would matter for operator overloading. That doesn't explain Yoda conditions outside the parser, though.)

> Also, I failed to find consistency about when to
> use past and present tense, but tried to use past tense for what was in the
> markup, and present tense for conditions that failed.

Ouch. I guess the existing messages suck. :-(
 
> I deliberately did not support these two conditions because they are weird:[2]
> 
>  - The itemref attribute, if specified, must have a value that is an unordered
> set of unique space-separated tokens that are case-sensitive, consisting of IDs
> of elements in the same home subtree.

What's weird about this requirement?

> +            String[] tokens = input.split("[ \\t\\n\\f\\r]+");

This ends up compiling a new regular expression every time. Please use org.whattf.checker.AttributeUtil.split() instead.

> +            for (int i = 0; i < tokens.length; i++)
> +                this.add(tokens[i]);

Please use braces around the loop block for code style consistency.

Also, is the LinkedList nature of TokenList needed anywhere? Why not just hold onto the String[] that split() returns? OTOH, since org.whattf.checker.AttributeUtil.split() uses a LinkedList() internally, maybe it would make sense to have a variant of org.whattf.checker.AttributeUtil.split() that returns the LinkedList() and use that directly without having a subclass, since the subclass doesn't appear to provide any methods anyway.

Would it work to factor out the LinkedList part of org.whattf.checker.AttributeUtil.split() and not have a specific class for TokenList?

> +        TokenList itemProp = null;

This TokenList is never traversed, is it? Is it there for future vocabulary validation or am I missing something?

> +                         checkItem(current, parents);

To my recollection, the HTML5 validation layer currently doesn't have recursive algorithms whose recursion depth is controlled by untrusted input, but I guess the recursion here is OK. If it becomes a Real Problem, it can be addressed then.

> +            err("The \u201Citemref\u201D attribute contained redundant references.", root.locator);

I was about to suggest different error text, but then I read the code again and realized my suggestion would have been wrong. I think this error message doesn't make it quite clear what the problem is, but I don't have better suggestions.

But where in the spec does it say that it's an error to have redundant references (other than duplicate tokens in one itemref attribute value)? That is, should this be just a warning (which would require a separate error for the case when one itemref attribute has duplicate tokens)?

Comment 8 Philip Jägenstedt 2011-07-28 16:43:36 CEST

(In reply to comment #7)

> > I tried to figure out when yoda conditions [1] should be used, but
> > failed and therefore made it inconsistent just like the pre-existing code ;)
> 
> Are there non-string Yoda conditions? With "foo".equals(bar), the Yoda
> condition deals nicely with bar being null. "foo" == bar comes out of the habit
> of doing .equals() string comparisons with the literal first. (Also, at one
> point when working on the parser code, I was assuming that string literals
> would translate to a special C++ class for Gecko and then the order would
> matter for operator overloading. That doesn't explain Yoda conditions outside
> the parser, though.)

No, I've only seen string Yoda conditions. If what I have now looks good to you, I'm happy as well.

> > I deliberately did not support these two conditions because they are weird:[2]
> > 
> >  - The itemref attribute, if specified, must have a value that is an unordered
> > set of unique space-separated tokens that are case-sensitive, consisting of IDs
> > of elements in the same home subtree.
> 
> What's weird about this requirement?

Sorry, I pasted the wrong constraints, see later comment.

> > +            String[] tokens = input.split("[ \\t\\n\\f\\r]+");
> 
> This ends up compiling a new regular expression every time. Please use
> org.whattf.checker.AttributeUtil.split() instead.

Will do.

> > +            for (int i = 0; i < tokens.length; i++)
> > +                this.add(tokens[i]);
> 
> Please use braces around the loop block for code style consistency.

Will do.

> Also, is the LinkedList nature of TokenList needed anywhere? Why not just hold
> onto the String[] that split() returns? OTOH, since
> org.whattf.checker.AttributeUtil.split() uses a LinkedList() internally, maybe
> it would make sense to have a variant of
> org.whattf.checker.AttributeUtil.split() that returns the LinkedList() and use
> that directly without having a subclass, since the subclass doesn't appear to
> provide any methods anyway.
> 
> Would it work to factor out the LinkedList part of
> org.whattf.checker.AttributeUtil.split() and not have a specific class for
> TokenList?

The Collection-ness is used in for (String id : root.itemRef). I'll look at using AttributeUtil instead of TokenList, it should be possible.

> > +        TokenList itemProp = null;
> 
> This TokenList is never traversed, is it? Is it there for future vocabulary
> validation or am I missing something?

Oops, it should actually be used to check itemProp.length>0 before recursing into sub-items, as itemprop="" doesn't actually add a property. It'll also be used when/if I add full support for collecting property names and values.

> > +                         checkItem(current, parents);
> 
> To my recollection, the HTML5 validation layer currently doesn't have recursive
> algorithms whose recursion depth is controlled by untrusted input, but I guess
> the recursion here is OK. If it becomes a Real Problem, it can be addressed
> then.

Flattening this out would probably make it hard to follow given that the tree traversal is also flattened. Will fix if/when someone finds a real-world example that causes a stack overflow.

> > +            err("The \u201Citemref\u201D attribute contained redundant references.", root.locator);
> 
> I was about to suggest different error text, but then I read the code again and
> realized my suggestion would have been wrong. I think this error message
> doesn't make it quite clear what the problem is, but I don't have better
> suggestions.
>
> But where in the spec does it say that it's an error to have redundant
> references (other than duplicate tokens in one itemref attribute value)? That
> is, should this be just a warning (which would require a separate error for the
> case when one itemref attribute has duplicate tokens)?

The problem is that there are two things that can go wrong here:

1. items that try to include themselves: <div itemscope id=x itemref=x>
2. redundant use of itemref: <div itemscope itemref="x y"> where x and y have a parent-child relationship or are the same.

Both of these are just called "microdata error" in the spec. I'm not sure how to phrase the error, but pointing squarely to itemref is certainly part of it.

The spec special-cases root because we (Opera) implemented it in a way that never always stop traversing at the root, for simplicity. Do you have any opinions? I'm sure both Opera and the spec could change if we can come up with something better.

Comment 9 Henri Sivonen 2011-07-28 17:16:29 CEST

> Both of these are just called "microdata error" in the spec.

Oh OK. I missed the part that gave algorithmically phrased conformance requirements.

Comment 10 Philip Jägenstedt 2011-07-28 22:28:00 CEST

Created attachment 206 [details]
no more TokenList

It turns out that for (String id : itemRef) works even when itemRef is String[], you learn something new every day...

Comment 11 Philip Jägenstedt 2011-07-28 23:30:27 CEST

Created attachment 207 [details]
MicrodataChecker (purge tabs again)

Dammit, with no mode line Emacs just won't behave!

Comment 12 Philip Jägenstedt 2011-07-29 03:50:31 CEST

Created attachment 208 [details]
MicrodataChecker (refactor)

Slightly refactored the ElementStack mechanism. It makes more sense this way with coming changes to collect textContent for itemValue. Sorry for the patch flood :-/

Comment 13 Philip Jägenstedt 2011-07-29 04:30:01 CEST

(In reply to comment #8)
> (In reply to comment #7)
> > > +        TokenList itemProp = null;
> > 
> > This TokenList is never traversed, is it? Is it there for future vocabulary
> > validation or am I missing something?
> 
> Oops, it should actually be used to check itemProp.length>0 before recursing
> into sub-items, as itemprop="" doesn't actually add a property. It'll also be
> used when/if I add full support for collecting property names and values.

Strike that, that's not what the spec actually says. For now, it could be replaced with a boolean attribute, but I won't unless you ask me to.

Comment 14 Philip Jägenstedt 2011-07-29 04:36:55 CEST

Created attachment 209 [details]
MicrodataChecker (oops)

Correct the brain malfunction, itemProp does not need any special treatment for length==0

Comment 15 Henri Sivonen 2011-07-29 17:30:28 CEST

Awesome. Thank you.

Pushed:
https://bitbucket.org/validator/syntax/changeset/c0ce11b9a706
https://bitbucket.org/validator/validator/changeset/0a49bcad4ce5

Also deployed to Validator.nu.