NOTE: The current preferred location for bug reports is the GitHub issue tracker.
Bug 921 - Fix the POM to apply the hotspot workaround to Tokenizer.java
Fix the POM to apply the hotspot workaround to Tokenizer.java
Status: RESOLVED FIXED
Product: Validator.nu
Classification: Unclassified
Component: HTML parser
HEAD
All All
: P2 normal
Assigned To: Nobody
Depends on:
Blocks: 923
  Show dependency treegraph
 
Reported: 2012-05-18 15:04 CEST by Romain Deltour
Modified: 2012-06-05 15:56 CEST (History)
1 user (show)

See Also:


Attachments
Patch to the POM file to apply the hotspot workaround to Tokenizer.java (2.80 KB, application/octet-stream)
2012-05-18 15:04 CEST, Romain Deltour
Details
new POM that first copies the sources to a transient directory (7.51 KB, text/xml)
2012-05-24 02:27 CEST, Romain Deltour
Details
pom.xml that copies the source for hotspot workaround plus updated scm and version metadata (7.51 KB, patch)
2012-05-24 11:29 CEST, Henri Sivonen
Details
variant of the previous POM that forces a cleanup by deleting the transient source directory in the init phase (7.65 KB, text/xml)
2012-05-24 16:04 CEST, Romain Deltour
Details
alternative POM that patches Tokenizer.java directly in the source directory, previously backing it up and reverting after compile (7.87 KB, text/xml)
2012-05-24 16:05 CEST, Romain Deltour
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Romain Deltour 2012-05-18 15:04:36 CEST
Created attachment 218 [details]
Patch to the POM file to apply the hotspot workaround to Tokenizer.java

htmlparser version 1.3.1 is not on Maven Central for the following reason:

(quoting Henri Sivonen):
Currently, if you compile the source
code of the parser as it is a stored in version control, the tokenizer
class ends up with a method that's larger in terms of bytes of byte
code that the magic threshold a method size above which HotSpot
refuses to compile methods to native code by default.  If deployed
like this, the parser will either be slow or requires a special
command line switch for the JVM when starting that JVM.

It seems like a bad idea to distribute a binary in this condition.
That's why I transform the source so that the large and method is
split into two smaller ones before I compile it for the binary
distribution.  This way, the parser performs well without special
switches for the JVM.
(end of quote)

Attached is a patch to the POM from the Mercurial version (rev 5b026546075d) that:
- removes the unused Tokenizer.java import in the ApplyHotSpotWorkaround
- adds an antrun goal to the POM which is executed during the process-sources lifecycle phase and compiles the ApplyHotSpotWorkaround class and uses it to patch Tokenizer.java
Comment 1 Henri Sivonen 2012-05-23 17:04:32 CEST
This patch makes mvn compile succeeded on the first run. However, because the tokenizer is modified in place, running further mvn commands, such as, mvn test-compile or mvn package fails, because the hotspot workaround is applied again and applying it twice breaks the tokenizer.

It seems to me that instead of modifying the tokenizer in place in the original source directory, Maven should generate a second transient directory of .java files that's a copy of src/, apply the hotspot workaround there, feed those .java files to javac and clean up the transient directory.

I have no idea how to do this with Maven, though. :-(
Comment 2 Romain Deltour 2012-05-23 17:07:48 CEST
Correct, I thought this was OK since it was my understanding of how you're currently building  htmlparser, with manual reverts after each builds.

I'll have a look at what it takes to improve it if I can find some time anytime soon.
Comment 3 Romain Deltour 2012-05-24 02:27:08 CEST
Created attachment 219 [details]
new POM that first copies the sources to a transient directory
Comment 4 Romain Deltour 2012-05-24 02:29:30 CEST
The POM attached in comment #3 uses an antrun goal at the "initialize" lifecycle phase to first copy the entire source directory to a transient target/src directory, which is in turn declared as the source directory to be used by maven.

The hotspot workaround processing is unchanged but is consequently applied to this new transient directory instead.
Comment 5 Henri Sivonen 2012-05-24 11:29:36 CEST
Created attachment 220 [details]
pom.xml that copies the source for hotspot workaround plus updated scm and version metadata
Comment 6 Henri Sivonen 2012-05-24 11:38:32 CEST
Thank you.

The person who originally contributed the pom.xml instructed me to generate the Maven release by running
mvn source:jar javadoc:jar repository:bundle-create

When I did that, Maven complained about the SCM location, which had rotted. The latest attachment here updates the SCM location.

Still, when I try to run
mvn source:jar javadoc:jar repository:bundle-create
I get symptoms that indicate that the hotspot workaround is being applied twice.

It seems that this solution still needs some work to make sure that the tokenizer-hotspot-workaround task runs exactly once per initialize-sources run. :-(

(Also, it seems that the goalposts have shifted when it comes to getting stuff uploaded to the central repository. It appears that GPG integration into the Maven build is now required, too. I'll file another bug about that.)
Comment 7 Henri Sivonen 2012-05-24 11:49:25 CEST
Filed bug 923 and bug 924.
Comment 8 Romain Deltour 2012-05-24 16:04:08 CEST
Created attachment 221 [details]
variant of the previous POM that forces a cleanup by deleting the transient source directory in the init phase
Comment 9 Romain Deltour 2012-05-24 16:05:09 CEST
Created attachment 222 [details]
alternative POM that patches Tokenizer.java directly in the source directory, previously backing it up and reverting after compile
Comment 10 Romain Deltour 2012-05-24 16:05:32 CEST
Attached are two alternative POMs that adopt different approaches at fixing the issue in command #6.

For full understanding of the situation, here's an explanation:

Every step in a Maven build is bound to a certain lifecycle phase [1]. The lifecycle is obviously ordered, when you invoke the execution of a phase, all the previous phases are necessarily executed.
The command you use has 3 goals:

 * source:jar invokes the execution of the lifecycle phase generate-sources prior to executing itself [2]
 * javadoc:jar does not invoke execution of lifecycle phases when executed independently [3]
 * repository:bundle-create invokes the execution of the lifecycle phase package prior to executing itself [4].
 
The problem mentioned in comment #6 is that no cleanup is done between "source:jar" and "repository:bundle" and the patched of Tokenizer.java is run twice. This could be fixed by invoking instead:
  mvn source:jar javadoc:jar clean repository:bundle-create

I also attached to alternative POMs that fix it and do not mandate an explicit intermediary cleanup.
 
A) In pom-A.xml attached (variant of the previously attached POM):

 1. the sources are copied to the target/src directory in the "initialize" phase and the 
 2. the Tokenizer.java in the target/src is patched in the "process-sources" phase.
 3. the build goes on, using target/src as the source directory.
 
To avoid any issue with successive invokes of 1-2-3 I added an ant instruction in the "initialize" phase to delete any remains of the transient directory before copying the sources in it.

The drawback of this approach is that it kind of forces a "clean" of the sources because they are systematically cleaned and copied over in the initialize phase.

B) In pom-B.xml attached:

 1. Tokenizer.java is backed-up as Tokenizer.java.bak in the "process-sources" phase
 2. the Tokenizer.java in the ${basedir}/src is patched in the "process-sources" phase
 3. Tokenizer.java is reverted from the backup at the end of the "compile" phase
 
The benefit of this approach is that there is no forced "cleanup"" of the sources (except for the Tokenizer.java file). So successive calls to "mvn compile" will not re-compile everything, only Tokenizer.java.
The only drawback is that if the "process-sources" phase is executed and not followed by the "compile" phase, it will leave a patched Tokenizer.java in the sources. That should however not happen in a regular use of maven commands (people generally don't  invoke "mvn process-sources" by its own).

I would recommend using the alternative B.

Finally, note that all these approach are sort of workarounds, but that's the best I could come up with based on this particular use case and without having to reorganize the source tree. I'm relatively new to Maven, but it seems I'm the de facto "expert" here :p. If you now Maven experts, feel free to ask for advice / review.


[1] http://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html
[2] http://maven.apache.org/plugins/maven-source-plugin/jar-mojo.html
[3] http://maven.apache.org/plugins/maven-javadoc-plugin/jar-mojo.html
[4] http://maven.apache.org/plugins/maven-repository-plugin/bundle-create-mojo.html
Comment 11 Henri Sivonen 2012-06-05 15:56:38 CEST
Thank you. Following my own intuition about overwriting source code as part of the build, I chose your option A.

http://hg.mozilla.org/projects/htmlparser/rev/913d3e0c4f6b
http://hg.mozilla.org/projects/htmlparser/rev/7d583b96856f