Bugzilla – Bug 883
Please issue a warning when UTF-8 isn't used.
Last modified: 2016-02-04 17:49:03 CET
Spec says: ]] Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629] [[ I suggest that the validator do implement this MAY and start advicing authors to use and declare UTF-8 as the encoding.
I think a warning makes sense when the page contains a form element, because it's harmful to use encodings that can't represent all possible user input in form submissions. When the encoding has been declared but the encoding isn't UTF-8 and the page doesn't have forms, I think complaining about the encoding would be an overkill.
(0) GENERAL: * Do you use NOTES and not only WARNINGS and ERRORS? I consider a NOTE as the statement of a SHOULD rule. * If a document "breaks" a SHOULD rule, such as the rule to use UTF-8, then I propose that the validator could issue a NOTE. Except when the page contains forms content or some of the contents listed under (3) below - then I suggest a WARNING (or if you can convince hixie: ERROR). (1) FORMS: Yes, I agree that there is an increased need for a warning when the page contains form elements. (In fact, this is a situation which belongs together iwth HTML5's recommendation that all new content/documents should be UTF-8 encoded - see (2).) But note that if such a forms content page is used to produce another page/content - such as in guest books and page comments, thten that other page also needs to use UTF-8 -compatible ... (2) OVERKILL OR AUTHORING TOOL: The validator is an authoring tool. So could we please relate the MAY that is the subject of this bug report, to the following SHOULD in HTML5: ? "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" http://dev.w3.org/html5/spec/semantics.html#charset What does the above recommendation mean to Validator.nu? I suggest that it should mean the following: a) When a page uses an HTML5 DOCTYPE, then the page should be considered as newly created, and thus the validator should expect UTF-8. (To avoid that people speculate in the effects of the DOCTYPE, any document that is tested for HTML5 validity, should be considered as "new", even if they use a legacy DOCTYPE.) THUS: b) If the encoding isn't declared, but the page otherwise is UTF-8 compatible (a.k.a. US-ASCII), then warn against the lack of declaring as UTF-8. (You told elsewhere that yourself wanted this to be an error, but that Hixie said no. However you have Hixie's blessing to issue a warning instead. Only that I think you should warn not only for lack of declaration but even spec that it lacks that UTF-8 is declared.) c) If the page DECLARES a non-UTF-8 encoding (with which it is compatible), then simply issue a NOTE (or eventually: WARNING) that HTML5, due to such and such reasons, recommends the use of UTF-8 d) But if it declares an encoding with which is compatible BUT contains forms content OR other risky content - such as those I list under point (3) below, then issue a WARNING (and not at NOTE). Validator.nu and/or its code will be used/embedded/etc in many authoring tools, and hence it could have a good effect, one should think, if it whines when non-UTF-8 is used with HTML5 documents, no? Note in this regard your recommendation in the WHATwg mailing list that I focus on convincing authoring tools makers. For example, this would mean that pages created by Komodo would create a warning or a note, until it changes its default encoding to UTF-8: http://bugs.activestate.com/show_bug.cgi?id=90424 (3) RISKY CONTENT that increases the need for a warning against use of non-UTF-8: a) use of non-ASCII in links (perhaps it could be formulated as "use of IRIs" - or better: "use fo IRIs with confusing semantics"). Links which contains non-ASCII have dubious semantics unless the page is UTF-8 encoded. b) use of non-ASCII inside @id attributes. Or else linking and IDREFerencing those fragments risk failing due to disagreeing UA implementations and the confusing semantics of for authors of non-UTF-8 IDREFS c) use of non-empty @srcdoc attributes (and possibly use of data URIs inside @src of <iframe> or @data of <object> d) Use of legacy encodings when the page only contains ASCII. Such pages are likely to be templates (Lorem Ipsum). Or if they are used for real output - let's say - for English, then it hampers which characters are possible to use. E.g. «smart quotes» are not possible inside Shift-JIS, I believe - and at least not wtih US-ASCII or KOI8-R etc. (4) HOW TO WARN: It would perhaps be a good idea to have a "NON-UTF-8 IN USE" heading, and to collect the different reasons for not using non-UTF-8 under that heading. This because I don't like the way the W3 HTML4 validator works, where one often has got to think a lot before one understand why fixing something in one corner opens up an error in another corner etc. Comments? Especially on ((0), (2) and (3)? (Btw, I realize that (3d) and (2b) perhaps are the same thing.