Bugzilla – Bug 958
Try UTF8 by default and then if there is a malformed input rollback to CP1252
Last modified: 2013-01-17 17:22:21 CET
Created attachment 226 [details] The patch that fixes the problem Currently validatorNU tries to guess the encoding with different policies and if none of the sniffers succeeds, it assumes that the encoding is WINDOWS1252. Sadly this is not always right and we are having problems with some of the documents. Here is a patch that fixes the problem along with two documents - one in UTF8(lira-symbol-test.html) containing the problematic character and CP1252(windows-cp1252-test.html) to make sure that we are parsing those as well as before. My patch provides the following strategy - first set the encoding to UTF8 and if it fails then reinit the decoder with CP1252 encoding. Note that with the file that is CP1252 we didn't get any error of malformed content but it just silently decoded it - replacing one of the characters with two.
Created attachment 227 [details] UTF8 encoded file with lira character as a test
Created attachment 228 [details] Windows1252 encoded file just for testing the old behaviour