NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 490 - Charset.availableCharsets is inefficient
Charset.availableCharsets is inefficient
Status: ASSIGNED
Product: Validator.nu
Classification: Unclassified
Component: HTML parser
HEAD
All All
: P2 normal
Assigned To: Henri Sivonen
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-18 01:25 CEST by Philip Tucker
Modified: 2013-03-21 11:08 CET (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Philip Tucker 2009-04-18 01:25:39 CEST
From javadoc for java.nio.Charset.availableCharsets():

     * <p> The invocation of this method, and the subsequent use of the
     * resulting map, may cause time-consuming disk or network I/O operations
     * to occur.  This method is provided for applications that need to
     * enumerate all of the available charsets, for example to allow user
     * charset selection.  This method is not used by the {@link #forName
     * forName} method, which instead employs an efficient incremental lookup
     * algorithm.

This is called from nu.validator.htmlparser.io.Encoding static initializer.

As far as I can tel there is no way to avoid this file access. Nor is there any way via the nu.validator.htmlparser.dom.HtmlDocumentBuilder to skip charset sniffing (for my use, I'd be happy to force it to UTF-8 for all parsing).

Am I missing an alternative?
Comment 1 Henri Sivonen 2009-04-20 10:20:26 CEST
The static initializer for Encoding runs one per app lifetime, so under normal circumstances, its efficiency shouldn't matter. I realize that it might matter on an AppEngine-like setup. Do you have a proposal what to do instead without sacrificing correctness and while allowing autodiscovery of non-JDK rough ASCII superset decoders? Should I offer a magic system property that bypasses the encoding enumeration and pokes a hardcoded list of encodings instead? (Can one set system properties on AppEngine apps?)

HtmlDocumentBuilder should skip charset sniffing if you call .setEncoding("utf-8") on the InputSource object before passing it to the parser.
Comment 2 Philip Tucker 2009-04-21 21:49:52 CEST
For my use I'd, like an option to configure the parser to use UTF-8, and skip sniffing altogether. Or perhaps set a default encoding such that, when this value is unknown it works as currently implemented, but if it is set the parser only overrides it when an explicit charset is in the HTML stream.

That would eliminate this problem, and also make the parsing more efficient when we don't need to auto-detect charset.

- Philip

(In reply to comment #1)
> The static initializer for Encoding runs one per app lifetime, so under normal
> circumstances, its efficiency shouldn't matter. I realize that it might matter
> on an AppEngine-like setup. Do you have a proposal what to do instead without
> sacrificing correctness and while allowing autodiscovery of non-JDK rough ASCII
> superset decoders? Should I offer a magic system property that bypasses the
> encoding enumeration and pokes a hardcoded list of encodings instead? (Can one
> set system properties on AppEngine apps?)
> 
> HtmlDocumentBuilder should skip charset sniffing if you call
> .setEncoding("utf-8") on the InputSource object before passing it to the
> parser.
> 

Comment 3 Henri Sivonen 2013-03-21 11:08:37 CET
The notion of correctness has changed here. The plan is to hard-code alias data from the Encoding Standard and not to support encodings outside that spec.