Bug 841 – Allow href='http://example.org/demo?id=hello&copy=1&world=fun' (as per the updated definition of “ambiguous ampersand”)

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 841 - Allow href='http://example.org/demo?id=hello&copy=1&world=fun' (as per the updated definition of “ambiguous ampersand”)


Summary:	Allow href='http://example.org/demo?id=hello&copy=1&world=fun' (as per the up...

Status:	NEW

Product:	Validator.nu
Classification:	Unclassified
Component:	HTML parser
Version:	HEAD
Hardware:	All All

Importance:	P2 normal
Assigned To:	Henri Sivonen

URL:	http://www.whatwg.org/specs/web-apps/...

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-06-14 12:45 CEST by Mathias Bynens
Modified:	2015-04-13 19:26 CEST (History)
CC List:	8 users (show)

See Also:	https://bugzilla.mozilla.org/show_bug.cgi?id=1153920

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mathias Bynens 2011-06-14 12:45:35 CEST

Match the spec regarding ambiguous ampersands for attribute values as well.

IRC discussion: http://krijnhoetmer.nl/irc-logs/whatwg/20110614#l-618

Comment 1 Julian Reschke 2011-06-14 13:14:45 CEST

8.1.2.3 says:

"Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand."

<http://dev.w3.org/html5/spec/Overview.html#attributes-0>

How does that allow ambiguous ampersands?

Comment 2 Simon Pieters 2011-06-15 12:47:46 CEST

It doesn't. The definition of ambiguous ampersands has changed since it was implemented in v.nu though.

Comment 3 Julian Reschke 2011-06-15 12:54:41 CEST

Clarifying... So the ampersand in

  &world=fun

is not ambiguous.

When did this change? I recall this was discussed in the WG and rejected at some point; thus I'm surprised it's now allowed.

Comment 4 Mathias Bynens 2011-06-15 13:24:22 CEST

(In reply to comment #3)
> Clarifying... So the ampersand in
> 
>   &world=fun
> 
> is not ambiguous.

Indeed. http://www.whatwg.org/specs/web-apps/current-work/complete/syntax.html#syntax-ambiguous-ampersand says:

> An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is
> followed by one or more characters in the range U+0030 DIGIT ZERO (0)
> to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+007A LATIN
> SMALL LETTER Z, and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN
> CAPITAL LETTER Z, followed by a U+003B SEMICOLON character (;), where
> these characters do not match any of the names given in the named
> character references section.

Because of the “followed by a U+003B SEMICOLON character” part in the above definition, the ampersand in your example is not ambiguous.

Comment 5 Michael[tm] Smith 2012-10-03 09:20:18 CEST

I've written up a fix for this and will get it to Henri for review.

Comment 6 Brian Bergstrom 2012-10-25 22:43:21 CEST

Thanks for the patch Michael.  Will this be integrated and released sometime soon?

Comment 7 Michael[tm] Smith 2012-10-26 06:03:28 CEST

(In reply to comment #6)
> Thanks for the patch Michael.  Will this be integrated and released sometime
> soon?

I've pushed the patch to http://qa-dev.w3.org:8888/ for testing. You can try it there and let me know if you find any problems.

Dunno how soon it might land in the sources. It's not been reviewed yet and it's possible I may need to rewrite it.

Comment 8 Mathias Bynens 2012-10-26 11:32:02 CEST

(In reply to comment #7)
> (In reply to comment #6)
> > Thanks for the patch Michael.  Will this be integrated and released sometime
> > soon?
> 
> I've pushed the patch to http://qa-dev.w3.org:8888/ for testing. You can try it
> there and let me know if you find any problems.

I’ve tested the following markup:

    <!DOCTYPE html>
    <title></title>
    &123;
    &abc;
    foo &0; bar
    foo &lolwat; bar

If I understand correctly, all four ampersands in this example are ambiguous ampersands (and thus invalid). However, only two errors are thrown.

Is this a bug, or am I misunderstanding something?

Comment 9 Michael[tm] Smith 2012-10-26 11:42:27 CEST

(In reply to comment #8)
> I’ve tested the following markup:
> 
>     <!DOCTYPE html>
>     <title></title>
>     &123;
>     &abc;
>     foo &0; bar
>     foo &lolwat; bar
> 
> If I understand correctly, all four ampersands in this example are ambiguous
> ampersands (and thus invalid). However, only two errors are thrown.
> 
> Is this a bug, or am I misunderstanding something?

Nope, you understand correctly. They should all be reported as errors. Will take a look and see what I missed, and push a change out again to http://qa-dev.w3.org:8888/ once I've got it fixed.

Thanks much for testing

Comment 10 Michael[tm] Smith 2012-10-26 14:01:22 CEST

(In reply to comment #8)
>     <!DOCTYPE html>
>     <title></title>
>     &123;
>     &abc;
>     foo &0; bar
>     foo &lolwat; bar
> 
> If I understand correctly, all four ampersands in this example are ambiguous
> ampersands (and thus invalid). However, only two errors are thrown.

OK, fixed and pushed to http://qa-dev.w3.org:8888/ It should now show all four errors.

Comment 11 Jukka K. Korpela 2013-05-28 08:33:54 CEST

(In reply to comment #7)

> I've pushed the patch to http://qa-dev.w3.org:8888/ for testing. 

The server does not respond now when port 8888 is used.

I thought this bug had been fixed, but it’s still there. Since it’s about two years old and affects many people, should there be a “Known bugs” list, or link to such a list, on the validator’s front page, to reduce confusion?

Comment 12 Brian Bergstrom 2013-06-06 17:50:14 CEST

Any update on getting this into a new release?

Comment 13 Michael[tm] Smith 2013-06-07 04:54:01 CEST

Reassigning to Henri (the change for this needs to be made in the HTML parser code).

Comment 14 Michael[tm] Smith 2013-06-07 05:02:22 CEST

(In reply to comment #11)
> (In reply to comment #7)
> 
> > I've pushed the patch to http://qa-dev.w3.org:8888/ for testing. 
> 
> The server does not respond now when port 8888 is used.

Sorry about that. I've restarted it now.

Comment 15 Michael[tm] Smith 2013-06-07 05:16:32 CEST

(In reply to comment #11)
> I thought this bug had been fixed, but it’s still there. Since it’s about two
> years old and affects many people, should there be a “Known bugs” list, or link
> to such a list, on the validator’s front page, to reduce confusion?

Yes -- I will try to make some time do add that.

Comment 16 Michael[tm] Smith 2013-06-07 05:22:48 CEST

(In reply to comment #12)
> Any update on getting this into a new release?

I'll talk with Henri and see if we can get something landed within the next week.

Comment 17 Henri Sivonen 2013-06-07 15:54:58 CEST

I thought we were supposed to get the spec changed so that this sort of thing was invalid. What happened?

Comment 18 Michael[tm] Smith 2013-06-10 04:24:24 CEST

(In reply to comment #17)
> I thought we were supposed to get the spec changed so that this sort of thing
> was invalid. What happened?

What happened is I didn't file a new spec bug to revert the change that made this allowed instead of the spec saying it was an error (as the spec used to say). I can file a bug but I'm not what chance we have of convincing Hixie. For one thing there are a bunch of other people (Simon, for one -- maybe Mathias too) who really seem to want it to not be an error. Which is part of why Hixie changed it to begin with. So maybe it would be productive to first try to get (re)agreement about whether it should be an error or not. I guess I can send a message to the list about it.

Comment 19 Simon Pieters 2013-06-10 12:05:35 CEST

# [11:09] <zcorpan> MikeSmith: i think href="&copy=" is an error currently. "However, if this next character is in fact a U+003D EQUALS SIGN character (=), then this is a parse error, because some legacy user agents will misinterpret the markup in those cases."  
# [11:09] <zcorpan> http://www.whatwg.org/specs/web-apps/current-work/#tokenizing-character-references  
# [11:11] <zcorpan> also, i don't have a strong opinion on the document conformance around ampersands. i have had strong opinions about how they should be parsed, though :-)  
# [11:12] <zcorpan> http://html5.org/r/7679 is the change

Comment 20 Brian Bergstrom 2013-09-23 20:43:19 CEST

I'm not sure I understand the current status of this.  Is "href='http://example.org/demo?id=hello&copy=1&world=fun'" still considered valid and a defect in the validator.nu?

Comment 21 Steve Steiner 2015-02-04 15:04:34 CET

I'm also not sure I understand the current status of this.  

FaceBook's API (for one common example), does not normalize its URLS and this validator flags all '&' characters etc. as warnings.

Can this particular check just be turned off?

Thanks,

Steve

Comment 22 Jukka K. Korpela 2015-02-04 15:58:47 CET

I think this bug should now be closed as INVALID, since the behavior is not a bug according to the W3C HTML5 Recommendation. This follows from the parsing rules for HTML syntax in section 8. A construct consisting of “&” and a nonempty sequence of alphanumeric characters is not valid (in an attribute value or elsewhere) if not immediately followed by “;”. This is expressed in a rather complicated way, and it was a wrong move IMHO, but it’s what the spec now says.

Notes: 1) This is an error message, not a warning, and this is consistent with the spec: “parse error” is meant to mean an error condition. 2) Users can use http://validator.w3.org/nu/ and its “Message filtering” feature to filter out error messages related to this issue, if they think they are unnecessary and make it more difficult to see the “real errors” in markup.

Comment 23 Ezequiel Garzon 2015-02-28 23:24:36 CET

Jukka, the standard says [1], paraphrasing, that you should use "&amp;" instead of "&amp" for the 106 annoying legacy exceptions (`grep -v ';"' entities.json`, [2]). If you're not dealing with one of those cases, you just need to avoid ambiguous ampersands, which are quite clearly defined [3]:

"An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more alphanumeric ASCII characters, followed by a ";" (U+003B) character, where these characters do not match any of the names given in the named character references section."

You may feel this allows too much, that it encourages bad practices, and you may even convince HTML editors to change the rules, but as of now "S&P " is unquestionably valid [4], while "S&P;" is not. A very engaging article [5] covering this was written by Mathias Bynens (the bug submitter).

Now, I didn't really dive into the parsing section. If it says what you claim, then the standard editors really need to fix this inconsistency, one way or another. Again, "& " is not an ambiguous ampersand, as it's not followed either by that necessary alphanumeric ASCII character, and then the semicolon. By the way, currently the W3C validator is OK with "& " (and "&& "), but not with "&P ".

[1] http://www.w3.org/TR/html5/syntax.html#character-references
[2] http://www.w3.org/TR/html5/entities.json
[3] http://www.w3.org/TR/html5/syntax.html#syntax-ambiguous-ampersand
[4] "P" is not a named character reference in [2], so appending "P" and a space to an ampersand makes it unambiguous.
[5] https://mathiasbynens.be/notes/ambiguous-ampersands

Comment 24 Michael[tm] Smith 2015-04-13 19:25:40 CEST

FYI, opened https://bugzilla.mozilla.org/show_bug.cgi?id=1153920 for this and submitted patches there