NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 640 - Many valid language codes are rejected
Many valid language codes are rejected
Status: RESOLVED FIXED
Product: Validator.nu
Classification: Unclassified
Component: General
HEAD
All All
: P2 normal
Assigned To: Michael[tm] Smith
Depends on: 643 679
Blocks:
  Show dependency treegraph
 
Reported: 2009-09-03 23:39 CEST by Aryeh Gregor
Modified: 2009-11-25 12:39 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aryeh Gregor 2009-09-03 23:39:26 CEST
I'm trying to convert the interlanguage front page of Wikipedia, http://www.wikipedia.org/, to HTML 5.  There are a bunch of easily fixable errors, but the page contains tons of language codes that the validator reports as invalid (scroll down):

http://validator.nu/?doc=http://www.wikipedia.org&schema=http://s.validator.nu/html5/html5full.rnc+http://s.validator.nu/html5/assertions.sch+http://c.validator.nu/all/&parser=html5

I've been told by MediaWiki developers who know more about language codes than me that the large majority of the "invalid language code"-type errors are bogus, and probably result from the validator using an outdated language code database.  It was suggested that perhaps it was using ISO 639-2 instead of ISO 639-3.  Some of the codes *are* apparently invalid, but it's hard for me to tell which based on the validator's output.

At this point I'm not completely clear on whether this is also a spec bug, or only a validator bug -- i.e., I'm not sure if the spec currently mandates the outdated language code list.  If it does, it has to be fixed too, I think.

It would be nice if we could say the front page of Wikipedia validated as HTML 5 sometime soon, and this seems to be a blocker for that.  Saying "it's really valid and the validator is wrong" somehow isn't quite the same.  :)
Comment 1 Michael[tm] Smith 2009-09-04 01:52:54 CEST
The actual language-tag list that the validator uses is the one at http://www.iana.org/assignments/language-subtag-registry

I happen to be sitting in the same room with fantasai today (working on something unrelated) and she pointed out that it looks like most of the codes that the validator is reporting as errors are three-letter language tags. So maybe there's just a bug in the validator that causes it to not handle three-letter tags correctly. 

Anyway, I'll take a quick look at the code and see if I can spot anything obvious
Comment 2 Michael[tm] Smith 2009-09-04 02:54:06 CEST
Actually, checking further, the validator code looks fine. It's reporting all the three-letter codes as bad because all of those cases were added to http://www.iana.org/assignments/language-subtag-registry on 2009-07-29 and validator.nu has not been deployed since then and so has not picked up the update yet.

After updating the language-subtag-registry in my workspace and running the validator locally, the only "Bad ISO language part in language tag" that it reports for the wikipedia.org page now is for lang="eml", and that error is correct because eml is not in http://www.iana.org/assignments/language-subtag-registry

It does still report "mo" as deprecated, but that report is correct according to the latest language-subtag-registry file.

The remaining language-tag errors it reports are about 12 other values, e.g., "zh-yue",  which is reports as "Found reserved language extension subtag". As far as I can tell from taking a quick look at the validator.nu code, it's reporting those for any three-letter subtags immediately following the primary subtag -- even if that three-letter subtag is actually in the language-subtag-registry. So that seems like it might be a bug in validator.nu that needs to be fixed.
Comment 3 Michael[tm] Smith 2009-09-05 05:38:00 CEST
(In reply to comment #2)
> The remaining language-tag errors it reports are about 12 other values, e.g.,
> "zh-yue",  which is reports as "Found reserved language extension subtag". As
> far as I can tell from taking a quick look at the validator.nu code, it's
> reporting those for any three-letter subtags immediately following the primary
> subtag -- even if that three-letter subtag is actually in the
> language-subtag-registry. So that seems like it might be a bug in validator.nu
> that needs to be fixed.

On close inspection, I see it's not simply a bug in the validator. "zh-yue" and the other tags it is reporting as  "Found reserved language extension subtag" are actually all marked as "redundant" in http://www.iana.org/assignments/language-subtag-registry

So what needs to be done is that some specific handling for tags marked "redundant" needs to be added (similar to the handling for "grandfathered").
Comment 4 Michael[tm] Smith 2009-09-06 00:47:26 CEST
(In reply to comment #3)
> On close inspection, I see it's not simply a bug in the validator. "zh-yue" and
> the other tags it is reporting as  "Found reserved language extension subtag"
> are actually all marked as "redundant" in
> http://www.iana.org/assignments/language-subtag-registry
> 
> So what needs to be done is that some specific handling for tags marked
> "redundant" needs to be added (similar to the handling for "grandfathered").

And on inspection of RFC 5646 says that grandfathered and redundant language tags that are deprecated are, nonetheless, valid.

So it seems that validator.nu should emit warnings for those deprecated tags, instead of errors.
Comment 5 Henri Sivonen 2009-09-07 13:31:44 CEST
I thought "redundant" means that the language tag can be derived from the current registrations and RFC rules so it doesn't need to be in the registry on its own. Maybe there are other kinds of "redundant" that I overlooked.
Comment 6 Michael[tm] Smith 2009-11-25 12:26:02 CET
resolved by changes made for bug 679
Comment 7 Michael[tm] Smith 2009-11-25 12:39:10 CET
(In reply to comment #5)
(quoting Henri)
> I thought "redundant" means that the language tag can be derived from the
> current registrations and RFC rules so it doesn't need to be in the registry on
> its own. Maybe there are other kinds of "redundant" that I overlooked.

There are some language tags that in the registry that are marked as redundant but that are also marked as deprecated. That's the case we want to report. They are still valid, so not errors, but it's a case that it seems like we should warn users about.