Bug 883 – Please issue a warning when UTF-8 isn't used.

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 883 - Please issue a warning when UTF-8 isn't used.


Summary:	Please issue a warning when UTF-8 isn't used.

Status:	RESOLVED FIXED

Product:	Validator.nu
Classification:	Unclassified
Component:	General
Version:	HEAD
Hardware:	All All

Importance:	P2 normal
Assigned To:	Nobody

URL:	http://dev.w3.org/html5/spec/semantic...

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-12-05 21:15 CET by Leif Halvard Silli
Modified:	2016-02-04 17:49 CET (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Leif Halvard Silli 2011-12-05 21:15:23 CET

Spec says:

]] Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629] [[

I suggest that the validator do implement this MAY and start advicing authors to use and declare UTF-8 as the encoding.

Comment 1 Henri Sivonen 2011-12-09 14:38:42 CET

I think a warning makes sense when the page contains a form element, because it's harmful to use encodings that can't represent all possible user input in form submissions.

When the encoding has been declared but the encoding isn't UTF-8 and the page doesn't have forms, I think complaining about the encoding would be an overkill.

Comment 2 Leif Halvard Silli 2011-12-09 19:11:02 CET

(0) GENERAL:
* Do you use NOTES and not only WARNINGS and ERRORS? I consider a NOTE as the statement of a SHOULD rule.

* If a document "breaks" a SHOULD rule, such as the rule to use UTF-8, then I propose that the validator could issue a NOTE. Except when the page contains forms content or some of the contents listed under (3) below - then I suggest a WARNING (or if you can convince hixie: ERROR).

(1) FORMS: Yes, I agree that there is an increased need for a warning when the page contains form elements. (In fact, this is a situation which belongs together iwth HTML5's recommendation that all new content/documents should be UTF-8 encoded - see (2).) But note that if such a forms content page is used to produce another page/content - such as in guest books and page comments, thten that other page also needs to use UTF-8 -compatible ...

(2) OVERKILL OR AUTHORING TOOL: The validator is an authoring tool. So could we please relate the MAY that is the subject of this bug report, to the following SHOULD in HTML5: ?

"Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]"
http://dev.w3.org/html5/spec/semantics.html#charset

What does the above recommendation mean to Validator.nu? I suggest that it should mean the following:

a) When a page uses an HTML5 DOCTYPE, then the page should be
considered as newly created, and thus the validator should expect UTF-8.
(To avoid that people speculate in the effects of the DOCTYPE, any document
that is tested for HTML5 validity, should be considered as "new", even
if they use a legacy DOCTYPE.)
THUS:
b) If the encoding isn't declared, but the page otherwise is UTF-8
compatible (a.k.a. US-ASCII), then warn against the lack of declaring
as UTF-8. (You told elsewhere that yourself wanted this to be an
error, but that Hixie said no. However you have Hixie's blessing to
issue a warning instead. Only that I think you should warn not only
for lack of declaration but even spec that it lacks that UTF-8 is declared.)
c) If the page DECLARES a non-UTF-8 encoding (with which it is compatible),
then simply issue a NOTE (or eventually: WARNING) that HTML5, due to
such and such reasons, recommends the use of UTF-8
d) But if it declares an encoding with which is compatible BUT contains
forms content OR other risky content - such as those I list under
point (3) below, then issue a WARNING (and not at NOTE).

Validator.nu and/or its code will be used/embedded/etc in many authoring tools, and hence it could have a good effect, one should think, if it whines when non-UTF-8 is used with HTML5 documents, no?

Note in this regard your recommendation in the WHATwg mailing list that I focus on convincing authoring tools makers. For example, this would mean that pages created by Komodo would create a warning or a note, until it changes its default encoding to UTF-8: http://bugs.activestate.com/show_bug.cgi?id=90424

(3) RISKY CONTENT that increases the need for a warning against use of non-UTF-8:

a) use of non-ASCII in links (perhaps it could be formulated as "use of IRIs" - or better: "use fo IRIs with confusing semantics"). Links which contains non-ASCII have dubious semantics unless the page is UTF-8 encoded.

b) use of non-ASCII inside @id attributes. Or else linking and
IDREFerencing those fragments risk failing due to
disagreeing UA implementations and the confusing
semantics of for authors of non-UTF-8 IDREFS

c) use of non-empty @srcdoc attributes (and possibly use of
data URIs inside @src of <iframe> or @data of <object>

d) Use of legacy encodings when the page only contains ASCII.
Such pages are likely to be templates (Lorem Ipsum). Or if
they are used for real output - let's say - for English, then it
hampers which characters are possible to use. E.g. «smart
quotes» are not possible inside Shift-JIS, I believe - and at
least not wtih US-ASCII or KOI8-R etc.

(4) HOW TO WARN: It would perhaps be a good idea to have a "NON-UTF-8 IN USE" heading, and to collect the different reasons for not using non-UTF-8 under that heading. This because I don't like the way the W3 HTML4 validator works, where one often has got to think a lot before one understand why fixing something in one corner opens up an error in another corner etc.

Comments? Especially on ((0), (2) and (3)? (Btw, I realize that (3d) and (2b) perhaps are the same thing.