1 2 Previous Next 29 Replies Latest reply on Feb 20, 2013 9:39 AM by rhauch

    Error while trying to setup Tika text extractor in modeshape

    satyakishor.m

      I am running into an issue while trying to setup Tike text extractor in mode shape. Following is the error I am getting when running my application with Tika text extractor.

       

      17:00:55,076 ERROR [stderr] (modeshape-text-extractor-7-thread-1) Exception in thread "modeshape-text-extractor-7-thread-1" java.lang.ExceptionInInitializerError

      17:00:55,091 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.<clinit>(PackagePropertiesUnmarshaller.java:49)

      17:00:55,091 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:154)

      17:00:55,091 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:141)

      17:00:55,107 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)

      17:00:55,107 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:99)

      17:00:55,107 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:207)

      17:00:55,123 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.tika.parser.pkg.ZipContainerDetector.detectOfficeOpenXML(ZipContainerDetector.java:194)

      17:00:55,123 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:134)

      17:00:55,123 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:77)

      17:00:55,138 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)

      17:00:55,138 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.modeshape.jcr.mimetype.TikaMimeTypeDetector.mimeTypeOf(TikaMimeTypeDetector.java:126)

      17:00:55,138 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.modeshape.jcr.mimetype.MimeTypeDetectors.mimeTypeOf(MimeTypeDetectors.java:74)

      17:00:55,154 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.modeshape.jcr.value.binary.AbstractBinaryStore.getMimeType(AbstractBinaryStore.java:161)

      17:00:55,154 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.modeshape.jcr.value.binary.StoredBinaryValue.getMimeType(StoredBinaryValue.java:69)

      17:00:55,154 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.modeshape.jcr.TextExtractors$Worker.run(TextExtractors.java:175)

      17:00:55,170 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

      17:00:55,170 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

      17:00:55,170 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at java.lang.Thread.run(Thread.java:722)

      17:00:55,185 ERROR [stderr] (modeshape-text-extractor-7-thread-1) Caused by: java.lang.ClassCastException: org.dom4j.DocumentFactory cannot be cast to org.dom4j.DocumentFactory

      17:00:55,185 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.dom4j.DocumentFactory.getInstance(DocumentFactory.java:97)

      17:00:55,185 ERROR [stderr] (modeshape-text-extractor-7-thread-1)           at org.dom4j.tree.AbstractNode.<clinit>(AbstractNode.java:39)

      17:00:55,201 ERROR [stderr] (modeshape-text-extractor-7-thread-1)      ... 18 more

       

      I am not sure why I am running into this issue. I checked in my classpath for more than one dom4j jars and I didn't find more than one dom4j jars.

       

      Following is my jcr configuration

      {

          "name" : "jerms",

          "jndiName" : "jcr/jerms",

          "storage" : {

              "transactionManagerLookup" = "org.infinispan.transaction.lookup.DummyTransactionManagerLookup"

          },

          "query" : {

              "indexStorage" : {

                  "type" : "ram"

              },

              "textExtracting": {

                  "extractors" : {

                      "tikaExtractor":{

                          "name" : "Tika content-based extractor",

                          "classname" : "tika"

                      }

                  }

              }         

          }

      }

       

      I am stuck at this issue for couple of hours, any help is appreciated.

        • 1. Re: Error while trying to setup Tika text extractor in modeshape
          rhauch

          This is definitely a classpath problem, but it could be one of two things:

           

          1. The JAR file is duplicated on the classpath.
          2. There are duplicate DOM4J classes on the classpath. It may be that there are indeed multiple JARs (e.g., if your application is a web application, your web application might include it as a dependency while the container/server also provides it), or it maybe that the JAR is not duplicated but one of the JARs on the classpath (provided by you or by your deployment environment) duplicates the classes (e.g., a non-ModeShape library might include it within its JAR, and the DOM4J JAR is brought in by ModeShape/Tika as a transitive dependency).

           

          Are you using Maven to build your application? If so, run 'mvn dependency:tree' and look for duplicates. You may try to add an explicit dependency on the problem artifact in your POM file, and give it a "provided" or "runtime" scope.

           

          And how are you deploying your application? If it is a web application, then make sure that your web application doesn't have a JAR that is also provided by the container/server.

          • 2. Re: Error while trying to setup Tika text extractor in modeshape
            satyakishor.m

            We are providing a dom4j jar file for our web application and I see that tika is also providing dom4j jar as part of its module. We are using ivy to build our application. Even though I set the dom4j in the ivy.xml to runtime scope still I get the same error.

             

            Our application was deployed on JBoss server. I am not sure what could be the root cause for this issue.

            • 3. Re: Error while trying to setup Tika text extractor in modeshape
              rhauch

              Did you try "provided"? I was mistaken and should never have suggestd "runtime", which means it's not needed for compilation but is needed at deployment/runtime. "Provided" means that it is required for compilation, but is not needed at deployment/runtime because the environment (JBoss AS in this case) provides it for the web app.

              • 4. Re: Error while trying to setup Tika text extractor in modeshape
                satyakishor.m

                I used "compile" scope for dom4j jar in ivy.xml file and still I see the same error.

                • 5. Re: Error while trying to setup Tika text extractor in modeshape
                  rhauch

                  And "provided"?

                  • 6. Re: Error while trying to setup Tika text extractor in modeshape
                    satyakishor.m

                    I am wondering whether I am setting up the tika extractor correctly or not.

                     

                    Currently I added the tika extractor module "<module name="org.modeshape.extractor.tika" />" within dependencies section under \modules\org\modeshape\main\module.xml. Since tika is already including the apache.tika module which actually includes dom4j, I think due to this dom4j is included twice.

                     

                    So, my question here is how to include the tika extractor module under jboss.

                    • 7. Re: Error while trying to setup Tika text extractor in modeshape
                      rhauch

                      I am wondering whether I am setting up the tika extractor correctly or not.

                       

                      Currently I added the tika extractor module "<module name="org.modeshape.extractor.tika" />" within dependencies section under \modules\org\modeshape\main\module.xml. Since tika is already including the apache.tika module which actually includes dom4j, I think due to this dom4j is included twice.

                       

                      So, my question here is how to include the tika extractor module under jboss.

                      So you're deploying to AS7? No, this is not the way to do it, and you shouldn't be using the JSON configuration, either.

                       

                      Follow the instructions in our documentation for installing and configuring ModeShape:

                       

                      1. Install the ModeShape kit into your AS7 installation. (It sounds like you've already done this, but maybe not.)
                      2. Configure the repositories, either by:
                        1. editing the standard "standalone.xml" file (we provide a "standalone-modeshape.xml" sample file that you can use; see what this looks like for 3.1.0.Final here), or
                        2. use the CLI to add/remove/configure repositories in an already-running AS7

                       

                      Using the CLI is much preferred, since it works against a running AS7 and can actually be scripted (which means you can easily add/remove/configure repositories in a development server and do pretty much the same configuration in a staging and/or production server, even though the rest of the AS7 configuration would likely be very different). But this is a bit harder to learn, so editing an XML configuration might be a good way to start. Note that using the CLI will cause the server to update it's configuration file, so don't be surprised to see comments disappear or for the file to change while the server is running.

                       

                      (Note that the CLI approach fits direclty into the AS7 management philosophy, which leverages its large and very powerful AS7 management mechanism. It may seem harder to use for one server, but it really comes into its own when managing multiple clusters of servers.)

                       

                      Then in your application, all you have to do is look up the Repository instance, using one of several available techniques. And because all of ModeShape's libraries are installed into AS7 with our kit, your application doesn't need to include any of ModeShape's JARs in your WAR. Instead, simply use ModeShape's BOM for AS7, which specifies all of ModeShape's libraries with "provided" scope. See our documentation for all the details.

                       

                      We also have a completely self-contained example of a web application that is deployed to AS7+ModeShape. (The JSON file in the project is extra and is not used.) Notice that it uses ModeShape's BOM for AS7 in the Maven dependencies.

                       

                      Hope this helps!

                      • 8. Re: Error while trying to setup Tika text extractor in modeshape
                        satyakishor.m

                        Yes, we are depolying the application to AS7. Currently I am just using ram JSON config to test in dev environment. So, are you saying that we should not use JSON configuration to set up the repository??.

                         

                        I already installed the Modeshape Kit into my AS7. But the only difference with your steps is that we configured the repository using JSON configuration.

                         

                        I will setup the repository using step 2.1 or 2.2 and test the Tika text extractor. I will let you know if I run into any issue.

                         

                        Thanks for your help.

                        • 9. Re: Error while trying to setup Tika text extractor in modeshape
                          rhauch

                          Yes, we are depolying the application to AS7. Currently I am just using ram JSON config to test in dev environment. So, are you saying that we should not use JSON configuration to set up the repository??.

                          I am suggesting that you do not use JSON configuration in any way with AS7, and only use the AS7 configuration mechanism. The reason is that with the AS7 configuration mechanism, the ModeShape, Infinispan, JGroups, clustering, and security components are managed by AS7, this is exactly how we test the system, and you'll have . You can configure an in-memory repository or any other configuration, and you can even use Arquillian to set up testable environment with custom standalone configuration files that Arquillian uses when it starts AS7. (See one of our integration modules for an example; it even uses Maven to download and install AS7, the kit, and the test-specific configuration file. This is really a great way to run integration tests with real components, and is surprisingly fast considering it's a full-blown system integration test.)

                           

                          If you use a JSON configuration, you're creating new, unmanaged ModeShape, Infinispan, JGroups, clustering and security components that are separate from the managed ones. Not only do we not test the JSON approach, but it's likely to have all kinds of weird behavior and potential errors.

                          • 10. Re: Error while trying to setup Tika text extractor in modeshape
                            satyakishor.m

                            I followed your steps in configuring the modeshape repository and text extractor. I was able to upload the files of type txt, pdf, 97-2003 MS office files (doc, xls, ppt) into repository and search returns correct nodes with attachments. But when I try to upload the MS office open office files (docx, xlsx, pptx) I am still getting the same error (java.lang.ClassCastException: org.dom4j.DocumentFactory cannot be cast to org.dom4j.DocumentFactory).

                             

                            Is this a bug with tika extractor or am I missing anything here.

                            • 11. Re: Error while trying to setup Tika text extractor in modeshape
                              rhauch

                              I don't think it's a bug, since we have integration tests that check this very thing (see the standalone-modeshape.xml file that we include as an example in our kit for AS7).

                               

                              However, since it's not working for you, something is obviously wrong. I'm not sure if it's configuration or related to the build, or whether there is some unknown bug in the TikaTextExtractor. Can you share your project with us? If so, please try to strip it down to the core elements, so that we can build it locally and help diagnose the problem.

                              • 12. Re: Error while trying to setup Tika text extractor in modeshape
                                satyakishor.m

                                Sorry for the late response.

                                 

                                You can reproduce this issue by following the below steps

                                 

                                1. Try to read/parse an excel spread sheet

                                2. While the read/parse is in progress, try to save another excel spread sheet as attachment into JCR repository.

                                • 13. Re: Error while trying to setup Tika text extractor in modeshape
                                  rhauch

                                  You can reproduce this issue by following the below steps

                                   

                                  1. Try to read/parse an excel spread sheet

                                  2. While the read/parse is in progress, try to save another excel spread sheet as attachment into JCR repository.

                                   

                                  So when you do this, do you really get the same ClassCastException that you described above?

                                   

                                  Perhaps Tika's Excel extractor is not thread-safe. Could you file a bug and add the information to it? If this is the case, we'll likely have to make the extractor thread-safe. (But I still won't understand the error if it is a ClassCastException.)

                                  • 14. Re: Error while trying to setup Tika text extractor in modeshape
                                    satyakishor.m

                                    Yes. I am getting the ClasCastException when I follow the above steps. Sure, I will file a bug will all the details.

                                    1 2 Previous Next