Bugzilla – Bug 841
Allow href='http://example.org/demo?id=hello©=1&world=fun' (as per the updated definition of “ambiguous ampersand”)
Last modified: 2015-04-13 19:26:24 CEST
Match the spec regarding ambiguous ampersands for attribute values as well. IRC discussion: http://krijnhoetmer.nl/irc-logs/whatwg/20110614#l-618
8.1.2.3 says: "Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand." <http://dev.w3.org/html5/spec/Overview.html#attributes-0> How does that allow ambiguous ampersands?
It doesn't. The definition of ambiguous ampersands has changed since it was implemented in v.nu though.
Clarifying... So the ampersand in &world=fun is not ambiguous. When did this change? I recall this was discussed in the WG and rejected at some point; thus I'm surprised it's now allowed.
(In reply to comment #3) > Clarifying... So the ampersand in > > &world=fun > > is not ambiguous. Indeed. http://www.whatwg.org/specs/web-apps/current-work/complete/syntax.html#syntax-ambiguous-ampersand says: > An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is > followed by one or more characters in the range U+0030 DIGIT ZERO (0) > to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+007A LATIN > SMALL LETTER Z, and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN > CAPITAL LETTER Z, followed by a U+003B SEMICOLON character (;), where > these characters do not match any of the names given in the named > character references section. Because of the “followed by a U+003B SEMICOLON character” part in the above definition, the ampersand in your example is not ambiguous.
I've written up a fix for this and will get it to Henri for review.
Thanks for the patch Michael. Will this be integrated and released sometime soon?
(In reply to comment #6) > Thanks for the patch Michael. Will this be integrated and released sometime > soon? I've pushed the patch to http://qa-dev.w3.org:8888/ for testing. You can try it there and let me know if you find any problems. Dunno how soon it might land in the sources. It's not been reviewed yet and it's possible I may need to rewrite it.
(In reply to comment #7) > (In reply to comment #6) > > Thanks for the patch Michael. Will this be integrated and released sometime > > soon? > > I've pushed the patch to http://qa-dev.w3.org:8888/ for testing. You can try it > there and let me know if you find any problems. I’ve tested the following markup: <!DOCTYPE html> <title></title> &123; &abc; foo &0; bar foo &lolwat; bar If I understand correctly, all four ampersands in this example are ambiguous ampersands (and thus invalid). However, only two errors are thrown. Is this a bug, or am I misunderstanding something?
(In reply to comment #8) > I’ve tested the following markup: > > <!DOCTYPE html> > <title></title> > &123; > &abc; > foo &0; bar > foo &lolwat; bar > > If I understand correctly, all four ampersands in this example are ambiguous > ampersands (and thus invalid). However, only two errors are thrown. > > Is this a bug, or am I misunderstanding something? Nope, you understand correctly. They should all be reported as errors. Will take a look and see what I missed, and push a change out again to http://qa-dev.w3.org:8888/ once I've got it fixed. Thanks much for testing
(In reply to comment #8) > <!DOCTYPE html> > <title></title> > &123; > &abc; > foo &0; bar > foo &lolwat; bar > > If I understand correctly, all four ampersands in this example are ambiguous > ampersands (and thus invalid). However, only two errors are thrown. OK, fixed and pushed to http://qa-dev.w3.org:8888/ It should now show all four errors.
(In reply to comment #7) > I've pushed the patch to http://qa-dev.w3.org:8888/ for testing. The server does not respond now when port 8888 is used. I thought this bug had been fixed, but it’s still there. Since it’s about two years old and affects many people, should there be a “Known bugs” list, or link to such a list, on the validator’s front page, to reduce confusion?
Any update on getting this into a new release?
Reassigning to Henri (the change for this needs to be made in the HTML parser code).
(In reply to comment #11) > (In reply to comment #7) > > > I've pushed the patch to http://qa-dev.w3.org:8888/ for testing. > > The server does not respond now when port 8888 is used. Sorry about that. I've restarted it now.
(In reply to comment #11) > I thought this bug had been fixed, but it’s still there. Since it’s about two > years old and affects many people, should there be a “Known bugs” list, or link > to such a list, on the validator’s front page, to reduce confusion? Yes -- I will try to make some time do add that.
(In reply to comment #12) > Any update on getting this into a new release? I'll talk with Henri and see if we can get something landed within the next week.
I thought we were supposed to get the spec changed so that this sort of thing was invalid. What happened?
(In reply to comment #17) > I thought we were supposed to get the spec changed so that this sort of thing > was invalid. What happened? What happened is I didn't file a new spec bug to revert the change that made this allowed instead of the spec saying it was an error (as the spec used to say). I can file a bug but I'm not what chance we have of convincing Hixie. For one thing there are a bunch of other people (Simon, for one -- maybe Mathias too) who really seem to want it to not be an error. Which is part of why Hixie changed it to begin with. So maybe it would be productive to first try to get (re)agreement about whether it should be an error or not. I guess I can send a message to the list about it.
# [11:09] <zcorpan> MikeSmith: i think href="©=" is an error currently. "However, if this next character is in fact a U+003D EQUALS SIGN character (=), then this is a parse error, because some legacy user agents will misinterpret the markup in those cases." # [11:09] <zcorpan> http://www.whatwg.org/specs/web-apps/current-work/#tokenizing-character-references # [11:11] <zcorpan> also, i don't have a strong opinion on the document conformance around ampersands. i have had strong opinions about how they should be parsed, though :-) # [11:12] <zcorpan> http://html5.org/r/7679 is the change
I'm not sure I understand the current status of this. Is "href='http://example.org/demo?id=hello©=1&world=fun'" still considered valid and a defect in the validator.nu?
I'm also not sure I understand the current status of this. FaceBook's API (for one common example), does not normalize its URLS and this validator flags all '&' characters etc. as warnings. Can this particular check just be turned off? Thanks, Steve
I think this bug should now be closed as INVALID, since the behavior is not a bug according to the W3C HTML5 Recommendation. This follows from the parsing rules for HTML syntax in section 8. A construct consisting of “&” and a nonempty sequence of alphanumeric characters is not valid (in an attribute value or elsewhere) if not immediately followed by “;”. This is expressed in a rather complicated way, and it was a wrong move IMHO, but it’s what the spec now says. Notes: 1) This is an error message, not a warning, and this is consistent with the spec: “parse error” is meant to mean an error condition. 2) Users can use http://validator.w3.org/nu/ and its “Message filtering” feature to filter out error messages related to this issue, if they think they are unnecessary and make it more difficult to see the “real errors” in markup.
Jukka, the standard says [1], paraphrasing, that you should use "&" instead of "&" for the 106 annoying legacy exceptions (`grep -v ';"' entities.json`, [2]). If you're not dealing with one of those cases, you just need to avoid ambiguous ampersands, which are quite clearly defined [3]: "An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more alphanumeric ASCII characters, followed by a ";" (U+003B) character, where these characters do not match any of the names given in the named character references section." You may feel this allows too much, that it encourages bad practices, and you may even convince HTML editors to change the rules, but as of now "S&P " is unquestionably valid [4], while "S&P;" is not. A very engaging article [5] covering this was written by Mathias Bynens (the bug submitter). Now, I didn't really dive into the parsing section. If it says what you claim, then the standard editors really need to fix this inconsistency, one way or another. Again, "& " is not an ambiguous ampersand, as it's not followed either by that necessary alphanumeric ASCII character, and then the semicolon. By the way, currently the W3C validator is OK with "& " (and "&& "), but not with "&P ". [1] http://www.w3.org/TR/html5/syntax.html#character-references [2] http://www.w3.org/TR/html5/entities.json [3] http://www.w3.org/TR/html5/syntax.html#syntax-ambiguous-ampersand [4] "P" is not a named character reference in [2], so appending "P" and a space to an ampersand makes it unambiguous. [5] https://mathiasbynens.be/notes/ambiguous-ampersands
FYI, opened https://bugzilla.mozilla.org/show_bug.cgi?id=1153920 for this and submitted patches there