Bugzilla – Bug 860
Unicode dashes become garbled coming out of the html parser
Last modified: 2011-10-06 08:04:02 CEST
The following dashes: Figure dash ‒: U+2012 En Dash –: U+2013 EM Dash —: U+2014 Horizontal Bar ―: U+2015 These all become garbled text coming out of the html parser. Note that for the En Dash and Em Dash, they are parsed fine if they are properly HTML escaped (– and &mdash).
Do you have an online test case? On the face of things, this looks like a failure to declare the character encoding of the input correctly.
Indeed an encoding problem. Htmllparser guess the wrong encoding. If we set the correct encoding, the dashes are parsed correctly.