Bugzilla – Bug 921
Fix the POM to apply the hotspot workaround to Tokenizer.java
Last modified: 2012-06-05 15:56:38 CEST
Created attachment 218 [details] Patch to the POM file to apply the hotspot workaround to Tokenizer.java htmlparser version 1.3.1 is not on Maven Central for the following reason: (quoting Henri Sivonen): Currently, if you compile the source code of the parser as it is a stored in version control, the tokenizer class ends up with a method that's larger in terms of bytes of byte code that the magic threshold a method size above which HotSpot refuses to compile methods to native code by default. If deployed like this, the parser will either be slow or requires a special command line switch for the JVM when starting that JVM. It seems like a bad idea to distribute a binary in this condition. That's why I transform the source so that the large and method is split into two smaller ones before I compile it for the binary distribution. This way, the parser performs well without special switches for the JVM. (end of quote) Attached is a patch to the POM from the Mercurial version (rev 5b026546075d) that: - removes the unused Tokenizer.java import in the ApplyHotSpotWorkaround - adds an antrun goal to the POM which is executed during the process-sources lifecycle phase and compiles the ApplyHotSpotWorkaround class and uses it to patch Tokenizer.java
This patch makes mvn compile succeeded on the first run. However, because the tokenizer is modified in place, running further mvn commands, such as, mvn test-compile or mvn package fails, because the hotspot workaround is applied again and applying it twice breaks the tokenizer. It seems to me that instead of modifying the tokenizer in place in the original source directory, Maven should generate a second transient directory of .java files that's a copy of src/, apply the hotspot workaround there, feed those .java files to javac and clean up the transient directory. I have no idea how to do this with Maven, though. :-(
Correct, I thought this was OK since it was my understanding of how you're currently building htmlparser, with manual reverts after each builds. I'll have a look at what it takes to improve it if I can find some time anytime soon.
Created attachment 219 [details] new POM that first copies the sources to a transient directory
The POM attached in comment #3 uses an antrun goal at the "initialize" lifecycle phase to first copy the entire source directory to a transient target/src directory, which is in turn declared as the source directory to be used by maven. The hotspot workaround processing is unchanged but is consequently applied to this new transient directory instead.
Created attachment 220 [details] pom.xml that copies the source for hotspot workaround plus updated scm and version metadata
Thank you. The person who originally contributed the pom.xml instructed me to generate the Maven release by running mvn source:jar javadoc:jar repository:bundle-create When I did that, Maven complained about the SCM location, which had rotted. The latest attachment here updates the SCM location. Still, when I try to run mvn source:jar javadoc:jar repository:bundle-create I get symptoms that indicate that the hotspot workaround is being applied twice. It seems that this solution still needs some work to make sure that the tokenizer-hotspot-workaround task runs exactly once per initialize-sources run. :-( (Also, it seems that the goalposts have shifted when it comes to getting stuff uploaded to the central repository. It appears that GPG integration into the Maven build is now required, too. I'll file another bug about that.)
Filed bug 923 and bug 924.
Created attachment 221 [details] variant of the previous POM that forces a cleanup by deleting the transient source directory in the init phase
Created attachment 222 [details] alternative POM that patches Tokenizer.java directly in the source directory, previously backing it up and reverting after compile
Attached are two alternative POMs that adopt different approaches at fixing the issue in command #6. For full understanding of the situation, here's an explanation: Every step in a Maven build is bound to a certain lifecycle phase [1]. The lifecycle is obviously ordered, when you invoke the execution of a phase, all the previous phases are necessarily executed. The command you use has 3 goals: * source:jar invokes the execution of the lifecycle phase generate-sources prior to executing itself [2] * javadoc:jar does not invoke execution of lifecycle phases when executed independently [3] * repository:bundle-create invokes the execution of the lifecycle phase package prior to executing itself [4]. The problem mentioned in comment #6 is that no cleanup is done between "source:jar" and "repository:bundle" and the patched of Tokenizer.java is run twice. This could be fixed by invoking instead: mvn source:jar javadoc:jar clean repository:bundle-create I also attached to alternative POMs that fix it and do not mandate an explicit intermediary cleanup. A) In pom-A.xml attached (variant of the previously attached POM): 1. the sources are copied to the target/src directory in the "initialize" phase and the 2. the Tokenizer.java in the target/src is patched in the "process-sources" phase. 3. the build goes on, using target/src as the source directory. To avoid any issue with successive invokes of 1-2-3 I added an ant instruction in the "initialize" phase to delete any remains of the transient directory before copying the sources in it. The drawback of this approach is that it kind of forces a "clean" of the sources because they are systematically cleaned and copied over in the initialize phase. B) In pom-B.xml attached: 1. Tokenizer.java is backed-up as Tokenizer.java.bak in the "process-sources" phase 2. the Tokenizer.java in the ${basedir}/src is patched in the "process-sources" phase 3. Tokenizer.java is reverted from the backup at the end of the "compile" phase The benefit of this approach is that there is no forced "cleanup"" of the sources (except for the Tokenizer.java file). So successive calls to "mvn compile" will not re-compile everything, only Tokenizer.java. The only drawback is that if the "process-sources" phase is executed and not followed by the "compile" phase, it will leave a patched Tokenizer.java in the sources. That should however not happen in a regular use of maven commands (people generally don't invoke "mvn process-sources" by its own). I would recommend using the alternative B. Finally, note that all these approach are sort of workarounds, but that's the best I could come up with based on this particular use case and without having to reorganize the source tree. I'm relatively new to Maven, but it seems I'm the de facto "expert" here :p. If you now Maven experts, feel free to ask for advice / review. [1] http://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html [2] http://maven.apache.org/plugins/maven-source-plugin/jar-mojo.html [3] http://maven.apache.org/plugins/maven-javadoc-plugin/jar-mojo.html [4] http://maven.apache.org/plugins/maven-repository-plugin/bundle-create-mojo.html
Thank you. Following my own intuition about overwriting source code as part of the build, I chose your option A. http://hg.mozilla.org/projects/htmlparser/rev/913d3e0c4f6b http://hg.mozilla.org/projects/htmlparser/rev/7d583b96856f