NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 958 - Try UTF8 by default and then if there is a malformed input rollback to CP1252
Try UTF8 by default and then if there is a malformed input rollback to CP1252
Status: NEW
Product: Validator.nu
Classification: Unclassified
Component: HTML parser
HEAD
All All
: P2 normal
Assigned To: Nobody
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-17 17:20 CET by Nikola Petrov
Modified: 2013-01-17 17:22 CET (History)
0 users

See Also:


Attachments
The patch that fixes the problem (1.71 KB, patch)
2013-01-17 17:20 CET, Nikola Petrov
Details
UTF8 encoded file with lira character as a test (99 bytes, text/html)
2013-01-17 17:21 CET, Nikola Petrov
Details
Windows1252 encoded file just for testing the old behaviour (98 bytes, text/html)
2013-01-17 17:22 CET, Nikola Petrov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nikola Petrov 2013-01-17 17:20:56 CET
Created attachment 226 [details]
The patch that fixes the problem

Currently validatorNU tries to guess the encoding with different policies and if none of the sniffers succeeds, it assumes that the encoding is WINDOWS1252. Sadly this is not always right and we are having problems with some of the documents. Here is a patch that fixes the problem along with two documents - one in UTF8(lira-symbol-test.html) containing the problematic character and CP1252(windows-cp1252-test.html) to make sure that we are parsing those as well as before.

My patch provides the following strategy - first set the encoding to UTF8 and if it fails then reinit the decoder with CP1252 encoding. Note that with the file that is CP1252 we didn't get any error of malformed content but it just silently decoded it - replacing one of the characters with two.
Comment 1 Nikola Petrov 2013-01-17 17:21:42 CET
Created attachment 227 [details]
UTF8 encoded file with lira character as a test
Comment 2 Nikola Petrov 2013-01-17 17:22:21 CET
Created attachment 228 [details]
Windows1252 encoded file just for testing the old behaviour