Bug 452 – <script charset="">var foo;</script> causes misleading errors

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 452 - <script charset="">var foo;</script> causes misleading errors


Summary:	<script charset="">var foo;</script> causes misleading errors

Status:	RESOLVED FIXED

Product:	Validator.nu
Classification:	Unclassified
Component:	HTML5 schema
Version:	HEAD
Hardware:	All All

Importance:	P2 normal
Assigned To:	Michael[tm] Smith

URL:

Depends on:	794
Blocks:
	Show dependency tree / graph

Reported:	2009-02-26 01:49 CET by Adam Roben
Modified:	2010-12-03 15:03 CET (History)
CC List:	1 user (show)

See Also:

Attachments
patch (4.82 KB, patch) 2010-12-01 00:59 CET, Michael[tm] Smith	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Adam Roben 2009-02-26 01:49:01 CET

Validate the following document:

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<script charset="">var foo;</script>
</body>
</html>

Two errors are produced:

1. Error: Required attributes missing on element script. From line 7, column 1; to line 7, column 19
2. Error: Text not allowed in element script in this context. From line 7, column 20; to line 7, column 27

I think that this document should only produce a single error:

charset attribute not allowed when src is not specified

See <http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#attr-script-charset>

Comment 1 Michael[tm] Smith 2010-09-24 04:46:37 CEST

Hi Adam,

We have implemented a fix for this case. To test it, please try validating your doc here:

  http://www.w3.org/html/check

That validator instance is running some code from my workspace that I've not pushed upstream yet.

Anyway, you should get this message:

Error: Element script is missing required attribute src.
From line 7, column 1; to line 7, column 19
d>↩<body>↩<script charset="">var fo

(You'll also get a message "Bad value  for attribute charset on element script: The empty string is not a valid character encoding name" but that is correct and expected).

Please let me know if you think that is an improvement.

Note that with a grammar-based validator like the RelaxNG-based one that validator.nu uses as its core, it's not possible to emit a message in the form "charset attribute not allowed when src is not specified". It is possible to have an additional message emitted for this case -- from another part of the code that does assertions-based checking -- but I would prefer to not add that unless you really think the "Element script is missing required attribute src" error message does not provide enough info for a user to act on.

See also the discussion at bug 766.

Comment 2 Adam Roben 2010-09-24 18:02:22 CEST

(In reply to comment #1)
> Anyway, you should get this message:
> 
> Error: Element script is missing required attribute src.
> From line 7, column 1; to line 7, column 19
> d>↩<body>↩<script charset="">var fo

This doesn't seem correct. <script> isn't required to have a src attribute. In fact, the spec says:

"When used to include data blocks (as opposed to scripts), the data must be embedded inline, the format of the data must be given using the type attribute, and the src attribute must not be specified."

> (You'll also get a message "Bad value  for attribute charset on element script:
> The empty string is not a valid character encoding name" but that is correct
> and expected).

That part sounds fine.

Comment 3 Michael[tm] Smith 2010-09-24 18:21:38 CEST

(In reply to comment #2)
> (In reply to comment #1)
> > Anyway, you should get this message:
> > 
> > Error: Element script is missing required attribute src.
> > From line 7, column 1; to line 7, column 19
> > d>↩<body>↩<script charset="">var fo
> 
> This doesn't seem correct. <script> isn't required to have a src attribute. In
> fact, the spec says:

It's saying it's required in this case because the charset attribute is specified, and the charset attribute is not valid unless the src attribute is also specified.

Comment 4 Adam Roben 2010-09-24 18:27:42 CEST

(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > Anyway, you should get this message:
> > > 
> > > Error: Element script is missing required attribute src.
> > > From line 7, column 1; to line 7, column 19
> > > d>↩<body>↩<script charset="">var fo
> > 
> > This doesn't seem correct. <script> isn't required to have a src attribute. In
> > fact, the spec says:
> 
> It's saying it's required in this case because the charset attribute is
> specified, and the charset attribute is not valid unless the src attribute is
> also specified.

Making the decision in this order ("Is charset specified? Then there'd better be a src attribute!") feels backwards to me. Consider this case:

Example A:

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<script charset="utf-8">var foo;</script>
</body>
</html>

Right now the validator says "Element script is missing required attribute src." But adding a src attribute isn't going to fix the problem, it's only going to give you a more confusing error message:

Example B:

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<script charset="utf-8" src="foo">var foo;</script>
</body>
</html>

"The text content of element script was not in the required format: Expected space, tab, newline, or slash but found v instead."

It seems like the high-order bit should be whether the script is inline or not. Then after that is figured out we can move on to whether charset is specified. So Example A should say something like "charset not allowed on inline scripts", and Example B should say something like "A script can't both have a src attribute and be inline".

Comment 5 Michael[tm] Smith 2010-09-25 08:01:21 CEST

My short answer here is that the current error reporting for this case is a compromise between the need to report something as useful as possible to users and the need to keep "special case" handling in the backend to a minimum -- and to not add significant additional complexity to the validation backend unless what it's buying us is clearly worth it in terms of the implementation and maintenance costs. 

I also want to say that validation is not meant as a complete substitute for actually reading the spec or for actually reading some other document that explains the complexities of the script element. 

This is a perhaps a case where people are arguably going to be better off ultimately if they do read and understand the constraints in this part of the spec -- or read some guide that explains them well. And arguably, despite any the effort that we put into finessing the reporting around this is, it is still going to be confusing to anybody who has not yet read and understood those constraints.

The fact is, the rules around what attributes are allowed or not allowed on the script element -- and what text content is allowed in it -- are a relatively complex part of the HTML language.  The rules for most HTML elements are really quite simple to check and report on usefully. But the rules for this case are not.

So, trying to programmatically capture and check all that complexity in short error messages emitted by a validator is quite a challenge. Each error message for this case is sort of a just a view of an elaborately decorated room through a small keyhole in a different door to that room.

Anyway, I'm not saying we can't improve the reporting for this case, and I guess we should try to put some more thought into how to actually do that. But actually doing it is a different matter, because the architecture of the backend places some limitations on what can be done practically. If you want to know the details, read on.

(In reply to comment #4)
> Making the decision in this order ("Is charset specified? Then there'd better
> be a src attribute!") feels backwards to me.

It is backwards. Or at least it certainly is for a human. But a grammar-based validator has a different way of looking at it: Its world view on things is that all elements and attributes are disallowed by default and need to be explicitly allowed in the grammar/schema in order to be valid.

> Right now the validator says "Element script is missing required attribute
> src."

So that message is coming from a library called "jing", which is a RelaxNG (grammar-based) validator and which is essentially the core validation piece in the validator.nu architecture. That is the equivalent of the (legacy) DTD-based checker that W3C validator uses. And having a grammar-based checker like this is generally the minimal essential part of what's normally needed for making a validator; that is, a conformance checker can add other types of validation features in addition to the core grammar-based validator, and have them do checks that can't be done easily or at all using grammar-based validation.

And in that architecture, the way that things are usually done is to try to model as many of the constraints of the language -- HTML in this case -- as it is possible or practical to model in a grammar.

So the current RelaxNG grammar/schema that validator.nu uses tries its best to model everything it can for the constraints that are specified in the HTML5 specification.

In the case of the script element, the relevant part of the RelaxNG schema are in this file:

http://syntax.whattf.org/relaxng/core-scripting.rnc

And the relevant parts of that file are these:

        script.elem.embedded =
                element script { script.inner.embedded & script.attrs.embedded }
        script.attrs.embedded =
                (       common.attrs
                &       script.attrs.type?
                &       script.attrs.language? # restricted in Schematron
                )   
        script.elem.imported =
                element script { script.inner.imported & script.attrs.imported }
        script.attrs.imported =
                (       common.attrs
                &       script.attrs.src
                &       script.attrs.defer?
                &       script.attrs.async?
                &       script.attrs.type?
                &       script.attrs.charset?
                &       script.attrs.language? # restricted in Schematron
                )   

So, that is essentially modeling two possible high-level flavors of the script element:

A. "no attributes required" flavor. This flavor is not required to have any attributes at all, but may have the type and language attributes only (in addition to global attributes)

B. "src required" flavor. This flavor *must* have a required "src" attribute, and that may have other attributes, such as, again, the type and language attributes, just like "no attributes required" flavor -- but that may also have some additional attributes; namely, the "charset", "defer", and "async" attributes, which are not explicitly permitted on the "no attributes required" flavor (following the grammar-based validator world view that anything that is not explicitly allowed is by default not valid).

So in that model, to the RelaxNG/Jing part of the backend, the case of <script charset="foo"> is seen as, "Oh, this is *not* a valid case of the "no attributes required" flavor, because the "charset" attribute is not explicitly allowed for that case. But I see "charset" is allowed only for the "src required" flavor, so it must be intended to be an instance of that; however, there's no "src" attribute, so I need to emit an error message saying that the src attribute is missing."

So, emitting that message for this case is the *minimum* reporting that we need to do, if we want to have the core schema actually model as well as it can the constraints in the spec that it is capable of modeling.

> But adding a src attribute isn't going to fix the problem, it's only
> going to give you a more confusing error message:
> "The text content of element script was not in the required format: Expected
> space, tab, newline, or slash but found v instead."

So that message is coming from a different part of the backend, not the grammar-based checker, but instead some code (written in Java) that does data type checking -- mostly checking of dataypes for attribute values, but also (for one or two cases only) checking of the text contents of particular elements.

RelaxNG and all other grammar-based validation mechanisms I know of do not really themselves provide any useful means for doing datatype checking. RelaxNG and W3C XML Schema do provide a way to specify a regular expression that a particular attribute value must match against or that some particular text content must match against -- but if that's used, the only error message a user gets back is a simple binary "OK, valid, it matched" or "Oops, not valid, didn't match."

So what we do with validator.nu is to hand of the datatype checking to a specific HTML Datatype library that's written in an actual programming language (in this case, Java) and is able to do more fine-grained error reporting. What it actually does in the case of script is just to run the script-element text content through a normal state-machine parser : a very simple one custom-written just to parse instances the script -element microsyntax, which is defined by two ABNF productions in the HTML5 spec -

for the "src required" flavor:
http://dev.w3.org/html5/spec-author-view/scripting-1.html#inline-documentation-for-external-scripts

for the "no attributes required" flavor:
http://dev.w3.org/html5/spec-author-view/scripting-1.html#restrictions-for-contents-of-script-elements

Note that the "src required" production is a superset of the "no attributes required" one, so it's possible for text contents of a script to match both productions.

Anyway, ,when you do <script charset="utf-8" src="foo">var foo;</script> the Jing/RelaxNG part of the backend notes that this is a "src required" case, so it then hands it off to HTML5 Datatype library checker. And that checker expects the text content to start with either whitespace or a Javascript comment delimiter, and it finds that it actually does not, so it tries to report a useful error message.

> It seems like the high-order bit should be whether the script is inline or not.

I completely agree that is should be. That is certainly how any human looking at it is going to evaluate it. However, checking it programatically is a very different matter. That is because of the limitations of the architecture of the backend. And do note that those "limitations" are actually a *feature* of that backend that are optimal for the vast majority of the checks the backend does; this is just one of the small number of cases where it is not.

To describe it more specifically: I we were to take the approach of making "script is inline or not?" the high-order bit for checking in this case, it would amount to turning the backend upside down on its head.

What I mean it, it would require first checking the text content of the element. There are in fact currently no other elements in the entire HTML language for which the spec defines any relationship between their text contents and their attribute model. So the script element is absolutely unique in that regard, and checking it is a very special case.

(The spec does also place restrictions on the text contents of the style element, but those restrictions apply regardless of what attributes are specified on the element.)

So, in practical terms, what making the high-order bit be "script inline or not" would amount to is, we'd basically first need to change the attribute model for the script element  in the RelaxNG grammar to always permit charset, async, and defer -- even if the src attribute is not specified -- and move the attribute-checking logic to a different part of the backend code that it exercised later.

The reason we'd need to do that is, if we don't, the first message a user is going to see is the "src attribute missing" message. We could have them see messages in addition to that, but as long as we attempt to capture the attribute constraints in the grammar, users are going to get that message first.

So if we did that, then the text-contents checking step could theoretically be done first. Though that said, it would then have to not be a *checking* of the text contents, because the check of what's allowed depends on whether the src attribute is present or not. It instead would just be determining which of the two "script contents" ABNF productions, if any.

And keep in mind that the text contents can match both productions (because one production is a superset of the other).

Consider the case of an inline script that is commented of for some reason , or that consists entirely of comments. 

(Admittedly, that's not a common case, but it is valid; it is both a valid instance of the Javascript language and also valid as far as the HTML language is concerned. And it is conceivable that an author might want for some reason to have an inline script in a document and temporarily have it commented out during some particular authoring phase or something.)

The checker has no way of knowing whether that instance is intended to be an actual script -- and is just commented out temporarily, or whatever -- or whether it's supposed to entirely just be comments for an external script, in which case, the src attribute needs to be specified.

So we have an ambiguity in that case.

But setting aside that case, and looking at the <script charset="">var foo;</script> case you gave in your original bug report: That case is unambiguous. The text contents are not entirely comments, so it is unambiguously intended to be an inline script, and we can have the backend programatically make that determination.

So what we'd next need to do -- after determining that we have an inline script -- is go back and then start checking what attributes the element has.

But with our current backend, as far as I can tell, it would be extremely difficult -- if not impossible in practice -- to reasonably accomplish the checking in that order. The backend is simply not built to do things in that order. And changing it to enable to do it would be a major piece of work.

But even if the backend were changed to enable checking in that order, it would be useful only for the case of this one element -- because as I mentioned before, there are no other elements in the language for which there is a relationship between their attribute mode and their text contents.

And it would fail for the case of an inline script that was commented out. It would not be able to determine whether the author intends for that to be an inline script, or comments for an external script.    

And going back to the issue of the grammar, note that removing attribute-constraint checking on script from the grammar (which we would need to do if we made its text-content checking the high-order bit) means that the grammar no longer models the minimum possible set of constraints for the  script element.

If you keep in mind that that grammar is not only useful within the context of validator.nu -- it's still useful just on its on own, for doing standalone checking of documents from the command line or whatever -- than it is probably not a good idea to remove constraint checking from the grammar based on the assumption that it will always be getting done elsewhere, in some other other part of a more complex multi-component system such as the one that validator.nu uses.

> Then after that is figured out we can move on to whether charset is specified.
> So Example A should say something like "charset not allowed on inline scripts",
> and Example B should say something like "A script can't both have a src
> attribute and be inline".

I still fully agree that's exactly what we should be reporting -- *if* we could have validator.nu report it that way. But in practice, it would be very difficult to change the current backend to make it possible to emit those messages for those cases, for the reasons I've tried to outline above.

Comment 6 Michael[tm] Smith 2010-12-01 00:59:26 CET

Created attachment 187 [details]
patch

Comment 7 Michael[tm] Smith 2010-12-01 15:59:44 CET

syntax r578

https://github.com/validator/syntax/commit/5a47f12c006e1f123451b18fa4472dc421e142c6

Comment 8 Michael[tm] Smith 2010-12-01 16:12:18 CET

Adam,

I've checked in a fix for this bug that causes the validator to emit the following message instead:

"Error: Element script must not have attribute charset unless attribute src is also specified."

You can test it yourself here:

http://www.w3.org/html/check

...and it will get deployed at http://validator.nu/ and http://html5.validator.nu/ soonish

Comment 9 Michael[tm] Smith 2010-12-03 11:11:21 CET

https://bitbucket.org/validator/syntax/changeset/fb810925ebb8

Comment 10 Adam Roben 2010-12-03 15:03:12 CET

Looks great. Thank you!