1 Reply Latest reply on Aug 17, 2011 8:42 PM by rhauch

    tika/msoffice/poi library incompatability

    f4tq

      Hi,

           I've been trying to use modeshape-2.6.0.Beta2 particularly the jboss 6 kit with modeshape-extractor-tika unsuccessfully because when as6 boots I get this (modeshape/log/boot.log):

       

      --snip--

      17:21:22,652 INFO  [InjectableHandlerRegistry] Registering injectable handler: topic - org.torquebox.messaging.injection.DestinationInjectableHandler@7a6de1c4
      17:22:28,066 WARN  [ClassLoaderManager] Unexpected error during load of:org.apache.poi.hssf.eventusermodel.EventWorkbookBuilder$StubHSSFWorkbook: java.lang.VerifyError: Cannot inherit from final class
             at java.lang.ClassLoader.defineClass1(Native Method) [:1.6.0_26]
              at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) [:1.6.0_26]
              at java.lang.ClassLoader.defineClass(ClassLoader.java:615) [:1.6.0_26]
              at org.jboss.classloader.spi.base.BaseClassLoader.access$200(BaseClassLoader.java:52) [jboss-classloader.jar:2.2.0.GA]
      

      I figured out the jar/library issue by checking out and building 2.6.0.beta2 (see below).   In short, there is a library incompatability with poi 3.6 & 3.7 for (msoffice/tika-extractor respectively).  My short term fix is to add modeshape-extractor-tika and remove poi*jar from $JBOSS_HOME/server/modeshape/deploy/modeshape-services.jar  but I'd love to get this sorted out for the long-run.  Also, who knows whether something this breaks something else...

       

      BTW, modeshape-extractor-tika was not part of the jboss-as6 kit.  Is this why?

       

      Gritty details:

      # git clone -b modeshape-2.6.0.Beta2  https://github.com/ModeShape/modeshape.git 
      # mvn clean install
      # cd  deploy/jbossas/modeshape-jbossas-service
      # mvn dependency:tree<
      
      --snip--
      
      [INFO] +- org.modeshape:modeshape-sequencer-msoffice:jar:2.6-SNAPSHOT:compile
      [INFO] |  +- org.apache.poi:poi:jar:3.6:compile
      [INFO] |  \- org.apache.poi:poi-scratchpad:jar:3.6:compile
      [INFO] +- org.modeshape:modeshape-sequencer-teiid:jar:2.6-SNAPSHOT:compile
      

      Now running the same for tika,

      # cd extensions/modeshape-extractor-tika
      # mvn dependency:tree
      org.modeshape:modeshape-extractor-tika:jar:2.6-SNAPSHOT
      [INFO] +- org.modeshape:modeshape-graph:jar:2.6-SNAPSHOT:compile
      [INFO] |  +- org.modeshape:modeshape-common:jar:2.6-SNAPSHOT:compile
      [INFO] |  \- joda-time:joda-time:jar:1.6:compile
      [INFO] +- org.modeshape:modeshape-graph:test-jar:tests:2.6-SNAPSHOT:test
      [INFO] +- org.modeshape:modeshape-common:test-jar:tests:2.6-SNAPSHOT:test
      [INFO] +- org.apache.tika:tika-parsers:jar:0.9:compile
      [INFO] |  +- org.apache.tika:tika-core:jar:0.9:compile
      [INFO] |  +- org.apache.pdfbox:pdfbox:jar:1.4.0:compile
      [INFO] |  |  +- org.apache.pdfbox:fontbox:jar:1.4.0:compile
      [INFO] |  |  +- org.apache.pdfbox:jempbox:jar:1.4.0:compile
      [INFO] |  |  \- commons-logging:commons-logging:jar:1.1.1:compile
      [INFO] |  +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
      [INFO] |  +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
      [INFO] |  +- org.apache.poi:poi:jar:3.7:compile
      [INFO] |  +- org.apache.poi:poi-scratchpad:jar:3.7:compile
      [INFO] |  +- org.apache.poi:poi-ooxml:jar:3.7:compile
      [INFO] |  |  +- org.apache.poi:poi-ooxml-schemas:jar:3.7:compile
      

       

      Thanks,

      Chris

        • 1. Re: tika/msoffice/poi library incompatability
          rhauch

          Thanks for figuring this out, Chris! Obviously this is a bug and should be fixed. Would you care to log a defect in our JIRA stating that the "modeshape-sequencer-msoffice" uses a different version of Apache POI than the one inherited by the "modeshape-extractor-tika" module? We can then fix it pretty easily. If you want to fork our GitHub repository, fix it locally, and create a pull-request, we'd gladly accept it!

           

          Also, it's probably worth another defect to say that the JBoss AS kit should include the Tika extract and libraries. I think that was just an oversight.

           

          I can think of two ways of fixing this locally:

           

          1) If you are not using the MS Office sequencer: just remove the "modeshape-sequencer-msoffice-2.6.0.Beta2.jar" from your JBoss AS installation, and make sure it's no longer in the ModeShape configuration.

           

          2) If you do want to use both the Tika text extract and the MS Office sequencer:  In your local codebase, try changing the "modeshape-sequencer-msoffice" module to use Apache POI 3.7 and then build ModeShape locally. (Hopefully that will succeed; if not, then there are API compatibility issues between 3.6 and 3.7!) You could then use the generated "modeshape-sequencer-msoffice-2.6.0.Beta2.jar" file and Apache POI 3.7 JARs in your JBoss AS installation.

           

          I wish there were an easy way of using Maven to make the "modeshape-sequencer-msoffice" module depend on the same version of POI that Tika does (without depending on Tika and excluding everyting but POI, which is gross and fragile). The easiest way is just change the "modeshape-sequencer-msoffice" POM file to match the same version as used by Tika. Maybe I'm missing something.

           

          Thanks again!