Bugzilla – Bug 490
Charset.availableCharsets is inefficient
Last modified: 2013-03-21 11:08:37 CET
From javadoc for java.nio.Charset.availableCharsets(): * <p> The invocation of this method, and the subsequent use of the * resulting map, may cause time-consuming disk or network I/O operations * to occur. This method is provided for applications that need to * enumerate all of the available charsets, for example to allow user * charset selection. This method is not used by the {@link #forName * forName} method, which instead employs an efficient incremental lookup * algorithm. This is called from nu.validator.htmlparser.io.Encoding static initializer. As far as I can tel there is no way to avoid this file access. Nor is there any way via the nu.validator.htmlparser.dom.HtmlDocumentBuilder to skip charset sniffing (for my use, I'd be happy to force it to UTF-8 for all parsing). Am I missing an alternative?
The static initializer for Encoding runs one per app lifetime, so under normal circumstances, its efficiency shouldn't matter. I realize that it might matter on an AppEngine-like setup. Do you have a proposal what to do instead without sacrificing correctness and while allowing autodiscovery of non-JDK rough ASCII superset decoders? Should I offer a magic system property that bypasses the encoding enumeration and pokes a hardcoded list of encodings instead? (Can one set system properties on AppEngine apps?) HtmlDocumentBuilder should skip charset sniffing if you call .setEncoding("utf-8") on the InputSource object before passing it to the parser.
For my use I'd, like an option to configure the parser to use UTF-8, and skip sniffing altogether. Or perhaps set a default encoding such that, when this value is unknown it works as currently implemented, but if it is set the parser only overrides it when an explicit charset is in the HTML stream. That would eliminate this problem, and also make the parsing more efficient when we don't need to auto-detect charset. - Philip (In reply to comment #1) > The static initializer for Encoding runs one per app lifetime, so under normal > circumstances, its efficiency shouldn't matter. I realize that it might matter > on an AppEngine-like setup. Do you have a proposal what to do instead without > sacrificing correctness and while allowing autodiscovery of non-JDK rough ASCII > superset decoders? Should I offer a magic system property that bypasses the > encoding enumeration and pokes a hardcoded list of encodings instead? (Can one > set system properties on AppEngine apps?) > > HtmlDocumentBuilder should skip charset sniffing if you call > .setEncoding("utf-8") on the InputSource object before passing it to the > parser. >
The notion of correctness has changed here. The plan is to hard-code alias data from the Encoding Standard and not to support encodings outside that spec.