NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 883 - Please issue a warning when UTF-8 isn't used.
Please issue a warning when UTF-8 isn't used.
Status: RESOLVED FIXED
Product: Validator.nu
Classification: Unclassified
Component: General
HEAD
All All
: P2 normal
Assigned To: Nobody
http://dev.w3.org/html5/spec/semantic...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-05 21:15 CET by Leif Halvard Silli
Modified: 2016-02-04 17:49 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2011-12-05 21:15:23 CET
Spec says:

]] Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629] [[

I suggest that the validator do implement this MAY and start advicing authors to use and declare UTF-8 as the encoding.
Comment 1 Henri Sivonen 2011-12-09 14:38:42 CET
I think a warning makes sense when the page contains a form element, because it's harmful to use encodings that can't represent all possible user input in form submissions.

When the encoding has been declared but the encoding isn't UTF-8 and the page doesn't have forms, I think complaining about the encoding would be an overkill.
Comment 2 Leif Halvard Silli 2011-12-09 19:11:02 CET
(0) GENERAL: 
     * Do you use NOTES and not only WARNINGS and ERRORS?  I consider a NOTE as the statement of a SHOULD rule.

     * If a document "breaks" a SHOULD rule, such as the rule to use UTF-8, then I propose that the validator could issue a NOTE. Except when the page contains forms content or some  of the contents listed under (3) below - then I suggest a WARNING (or if you can convince hixie: ERROR).

(1) FORMS: Yes, I agree that there is an increased need for a warning when the page contains form elements. (In fact, this is a situation which belongs together iwth HTML5's recommendation that all new content/documents should be UTF-8 encoded - see (2).) But note that if such a forms content page is used to produce another page/content - such as in guest books and page comments, thten that other page also needs to use UTF-8 -compatible ...

(2) OVERKILL OR AUTHORING TOOL: The validator is an authoring tool. So could we please relate the MAY that is the subject of this bug report, to the following SHOULD in HTML5: ?

 "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" 
  http://dev.w3.org/html5/spec/semantics.html#charset

  What does the above recommendation mean to Validator.nu? I suggest that it should mean the following:

  a) When a page uses an HTML5 DOCTYPE, then the page should be
      considered as newly created, and thus the validator should expect UTF-8.
      (To avoid that people speculate in the effects of the DOCTYPE, any document
      that is tested for HTML5 validity, should be considered as "new", even
      if they use a legacy DOCTYPE.)
      THUS:
  b) If the encoding isn't declared, but the page otherwise is UTF-8 
      compatible (a.k.a. US-ASCII), then warn against the lack of declaring 
      as UTF-8. (You told elsewhere that yourself wanted this to be an
      error,  but that Hixie said no. However you have Hixie's blessing to
      issue a warning instead. Only that I think you should warn not only
      for lack of declaration but even spec that it lacks that UTF-8 is declared.)
  c) If the page DECLARES a non-UTF-8 encoding (with which it is compatible),
      then simply issue a NOTE (or eventually: WARNING) that HTML5, due to
      such and such reasons, recommends the use of UTF-8
  d) But if it declares an encoding with which is compatible BUT contains
      forms content OR other risky content - such as those I list under
      point (3) below, then issue a WARNING (and not at NOTE).

      Validator.nu and/or its code will be used/embedded/etc in many authoring tools, and hence it could have a good effect, one should think, if it whines when non-UTF-8 is used with HTML5 documents, no? 

      Note in this regard your recommendation in the WHATwg mailing list that I focus on convincing authoring tools makers. For example, this would mean that pages created by Komodo would create a warning or a note, until it changes its default encoding to UTF-8: http://bugs.activestate.com/show_bug.cgi?id=90424

(3) RISKY CONTENT that increases the need for a warning against use of non-UTF-8:

a)  use of non-ASCII in links (perhaps it could be formulated as "use of IRIs" - or better: "use fo IRIs with confusing semantics"). Links which contains non-ASCII have dubious semantics unless the page is UTF-8 encoded.

b) use of non-ASCII inside @id attributes. Or else linking and 
    IDREFerencing those fragments risk failing due to 
    disagreeing UA implementations and the confusing
    semantics of for authors of non-UTF-8 IDREFS

c) use of non-empty @srcdoc attributes (and possibly use of
    data URIs inside @src of <iframe> or @data of <object>

d) Use of legacy encodings  when the page only contains ASCII. 
    Such pages are likely to be templates (Lorem Ipsum). Or if
    they are used for real output - let's say - for English, then it
    hampers which characters are possible to use. E.g. «smart
    quotes» are not possible inside Shift-JIS, I believe - and at
    least not wtih US-ASCII or KOI8-R etc. 
    

(4) HOW TO WARN: It would perhaps be a good idea to have a "NON-UTF-8 IN USE" heading, and to collect the different reasons for not using non-UTF-8 under that heading. This because I don't like the way the W3 HTML4 validator works, where one often has got to think a lot before one understand why fixing something in one corner opens up an error in another corner etc.

Comments? Especially on ((0), (2) and (3)?  (Btw, I realize that (3d) and (2b) perhaps are the same thing.