NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 860 - Unicode dashes become garbled coming out of the html parser
Unicode dashes become garbled coming out of the html parser
Classification: Unclassified
Component: HTML parser
All All
: P2 normal
Assigned To: Nobody
Depends on:
  Show dependency treegraph
Reported: 2011-10-05 00:06 CEST by yz
Modified: 2011-10-06 08:04 CEST (History)
1 user (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description yz 2011-10-05 00:06:11 CEST
The following dashes:
Figure dash ‒: U+2012
En Dash –: U+2013
EM Dash —: U+2014
Horizontal Bar ―: U+2015

These all become garbled text coming out of the html parser. 

Note that for the En Dash and Em Dash, they are parsed fine if they are properly HTML escaped (– and &mdash).
Comment 1 Henri Sivonen 2011-10-05 08:45:26 CEST
Do you have an online test case? On the face of things, this looks like a failure to declare the character encoding of the input correctly.
Comment 2 yz 2011-10-05 23:06:01 CEST
Indeed an encoding problem. Htmllparser guess the wrong encoding. If we set the correct encoding, the dashes are parsed correctly.