Bugzilla – Bug 444
Catch CoderMalfunctionError in Encoding.asciiMapsToBasicLatin()
Last modified: 2009-02-09 13:38:48 CET
On the IBM Java 5 JRE, the ISCII CharsetDecoder throws CoderMalfunctionError in Encoding.asciiMapsToLatin1(). Because this is an Error, not an Exception, it is not caught by the "catch (Exception e)" line. Although this is probably IBM's bug, it would be good if the HTML parser worked anyway. The stack trace is: java.nio.charset.CoderMalfunctionError: java.nio.BufferOverflowException at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:489) at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:511) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:252) at sun.nio.cs.StreamDecoder.read0(StreamDecoder.java:201) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:187) at java.io.InputStreamReader.read(InputStreamReader.java:196) at nu.validator.htmlparser.io.Encoding.asciiMapsToBasicLatin(Encoding.java:211) at nu.validator.htmlparser.io.Encoding.<clinit>(Encoding.java:110) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:194) at nu.validator.htmlparser.io.HtmlInputStreamReader.<init>(HtmlInputStreamReader.java:135) at nu.validator.htmlparser.io.Driver.tokenize(Driver.java:199) at nu.validator.htmlparser.dom.HtmlDocumentBuilder.tokenize(HtmlDocumentBuilder.java:405) at nu.validator.htmlparser.dom.HtmlDocumentBuilder.parse(HtmlDocumentBuilder.java:204) at zcarey.html.HTML5Document.main(HTML5Document.java:22) Caused by: java.nio.BufferOverflowException at java.nio.Buffer.nextPutIndex(Buffer.java:434) at java.nio.HeapCharBuffer.put(HeapCharBuffer.java:160) at com.ibm.nio.cs.ISCII91$Decoder.decodeLoop(ISCII91.java:211) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:485) ... 14 more I can reproduce this bug (and a bug with ISCII in Sun Java 5 as well) with this simple code: new InputStreamReader(new ByteArrayInputStream(new byte[3]), Charset.forName("ISCII").newDecoder()).read();
I don't have access to an IBM JRE at the moment, but I have added a catch block for CoderMalfunctionError on svn trunk. Should I also add ISCII to the list of banned encodings? If so, what's the preferred name that the IBM JRE returns for ISCII?
I don't think there's anything wrong with ISCII itself, only IBM's implementation of the decoder in their Java 5. After looking into it further, it looks like IBM switched to Sun's version for Java 6, and it works properly there. IBM Java 5 calls the encoding "ISCII91", while Sun Java 5, IBM Java 6 and Sun Java 6 call it "x-ISCII91" with "ISCII91" as an alias, so it seems reasonable to ban the broken version. You can download the IBM JDK for free, with added Eclipse, from http://www.ibm.com/developerworks/java/jdk/eclipse/index.html. Version 210/211 comes with IBM Java 5, while version 300 comes with IBM Java 6.
Added iscii91 to the banned list in svn. Marking fixed. Thanks.