Bug 958 – Try UTF8 by default and then if there is a malformed input rollback to CP1252

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 958 - Try UTF8 by default and then if there is a malformed input rollback to CP1252


Summary:	Try UTF8 by default and then if there is a malformed input rollback to CP1252

Status:	NEW

Product:	Validator.nu
Classification:	Unclassified
Component:	HTML parser
Version:	HEAD
Hardware:	All All

Importance:	P2 normal
Assigned To:	Nobody

URL:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-01-17 17:20 CET by Nikola Petrov
Modified:	2013-01-17 17:22 CET (History)
CC List:	0 users

See Also:

Attachments
The patch that fixes the problem (1.71 KB, patch) 2013-01-17 17:20 CET, Nikola Petrov	Details
UTF8 encoded file with lira character as a test (99 bytes, text/html) 2013-01-17 17:21 CET, Nikola Petrov	Details
Windows1252 encoded file just for testing the old behaviour (98 bytes, text/html) 2013-01-17 17:22 CET, Nikola Petrov	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nikola Petrov 2013-01-17 17:20:56 CET

Created attachment 226 [details]
The patch that fixes the problem

Currently validatorNU tries to guess the encoding with different policies and if none of the sniffers succeeds, it assumes that the encoding is WINDOWS1252. Sadly this is not always right and we are having problems with some of the documents. Here is a patch that fixes the problem along with two documents - one in UTF8(lira-symbol-test.html) containing the problematic character and CP1252(windows-cp1252-test.html) to make sure that we are parsing those as well as before.

My patch provides the following strategy - first set the encoding to UTF8 and if it fails then reinit the decoder with CP1252 encoding. Note that with the file that is CP1252 we didn't get any error of malformed content but it just silently decoded it - replacing one of the characters with two.

Comment 1 Nikola Petrov 2013-01-17 17:21:42 CET

Created attachment 227 [details]
UTF8 encoded file with lira character as a test

Comment 2 Nikola Petrov 2013-01-17 17:22:21 CET

Created attachment 228 [details]
Windows1252 encoded file just for testing the old behaviour