7 Replies Latest reply on Jun 7, 2012 11:10 AM by rhauch

    Testing subnodes

    edcorners

      My Jackrabbit repository has problems when handling more than 10.000 child nodes under a parent node. I wanted to test that on Modeshape so I made a repository based on the "Cars" example, but using a JPA datasource. Also, added to the CND a "photo" property.

       

      [car:Car] > nt:unstructured
        - car:maker (string)
      ...
        - car:engine (string)
        - car:photo (binary)                                     // cars' pic
      

       

      I tried to create about 2000 nodes under a particular parent with this definition and its reaching 1100 nodes before it starts throwing:

       

      15:01:20,031 ERROR [STDERR] Exception in thread "Lucene Merge Thread #0" Exception in thread "RMI TCP Connection(idle)" org.apache.lucene.index.MergePolicy$MergeException: java.lang.OutOfMemoryError: Java heap space
      15:01:20,031 ERROR [STDERR]           at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:517)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
      15:01:20,046 ERROR [STDERR] Caused by: java.lang.OutOfMemoryError: Java heap space
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.store.RAMFile.newBuffer(RAMFile.java:89)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.store.RAMFile.addBuffer(RAMFile.java:62)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.store.RAMOutputStream.switchCurrentBuffer(RAMOutputStream.java:132)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.store.RAMOutputStream.writeByte(RAMOutputStream.java:107)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.store.DataOutput.writeVInt(DataOutput.java:74)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:101)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:590)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:538)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:470)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:109)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4273)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3917)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
      15:01:20,046 ERROR [STDERR]           at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)
      15:01:20,046 ERROR [STDERR] java.lang.OutOfMemoryError: Java heap space
      15:01:20,046 ERROR [STDERR]           at java.io.BufferedInputStream.<init>(Unknown Source)
      15:01:20,046 ERROR [STDERR]           at java.io.BufferedInputStream.<init>(Unknown Source)
      15:01:20,046 ERROR [STDERR]           at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
      15:01:20,046 ERROR [STDERR]           at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
      15:01:20,046 ERROR [STDERR]           at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
      15:01:20,046 ERROR [STDERR]           at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      15:01:20,046 ERROR [STDERR]           at java.lang.Thread.run(Unknown Source)
      
      

       

      Since my app is very likely to present this case, I would be helpful to know if there's a better way to handle this situation in modeshape without causing a "Java heap space" too soon.

      Any suggestions? Thanks.

        • 1. Re: Testing subnodes
          rhauch

          Hi, Ed. Thanks for giving ModeShape a try.

           

          The memory footprint is highly a function of

          1. the data being stored as nodes and properties, and
          2. how the repository is configured, namely where data and indexes are stored.

           

          You've said that your repository already uses a JPA source (so I'm assuming you're using ModeShape 2.x, and hopefully 2.8.1.Final). But based upon the exception, your repository is using an in-memory Lucene index (the default) and you're likely running out of heap space. You can either increase the available heap (via JVM parameters), or configure the repository to store the indexes on the local file system (the ModeShape 2.x documentation shows how to do this).

           

          That should get you running. But I would like to mention a couple of other things.

           

          Another variable (especially with performance) is where the data is being stored. You're using JPA storage, but there are multiple connectors to choose from and they each have pros and cons (see How To Select The Right Connectors). Even if you stick with JPA and a relational database, be sure to look at How To Tune ModeShape for Better Performance.

           

          Finally, one of the reasons we made a big architectural change for ModeShape 3 was to make allow repositories to grow to very large sizes, in terms of the total numbers of nodes, the size of content (e.g., binary property values), and the number of child nodes under a single parent. And even though ModeShape 3 is still in alpha, ModeShape 3 repository can be substantially larger and faster with smaller memory footprints than similarly-configured ModeShape 2.x repositories. In fact, we've tested repositories with millions of nodes, and have created 10s of thousands of child nodes under a single parent without noticing much difference in performance. So I encourage you to give ModeShape 3 at least a cursory try, because I think you'll be very pleasantly surprised! Be aware, however, that configuring ModeShape 3 is very different (and hopefully much simpler), and its documentation is far from complete. If you're interested in running ModeShape in a Java SE environment, just ask for help or take a look at the schema for a repository configuration (which is a JSON Schema) and an example configuration file. Or, install ModeShape into AS7 where the data and indexes are stored by default in the server's data directory.

           

           

          Hope that helps!

          • 2. Re: Testing subnodes
            edcorners

            Thanks Randall, your answer was very useful!.

            Temporarily, I used queryIndex config to complete some tests.

            <mode:option jcr:name="queryIndexDirectory" mode:value="${jboss.server.data.dir}/modeshape/repositories/store/indexes"/>

            <mode:option jcr:name="queryIndexesRebuiltSynchronously" mode:value="true"/>

            <mode:option jcr:name="rebuildQueryIndexOnStartup" mode:value="ifMissing"/>

             

            It's working now.

            I think i'll be trying Modeshape 3 sooner than later. My biggest concerns when using a JCR repository are 1) Perfomance with large data and big sets of subnodes, 2) Transaction management when using a repository from EJBs. Had a rough experience with that aspects in the past, so any improvements are of my interest.

            • 3. Re: Testing subnodes
              rhauch

              Temporarily, I used queryIndex config to complete some tests.

              <mode:option jcr:name="queryIndexDirectory" mode:value="${jboss.server.data.dir}/modeshape/repositories/store/indexes"/>

              <mode:option jcr:name="queryIndexesRebuiltSynchronously" mode:value="true"/>

              <mode:option jcr:name="rebuildQueryIndexOnStartup" mode:value="ifMissing"/>

               

              It's working now.

              I think i'll be trying Modeshape 3 sooner than later.

              Ah, so you're using ModeShape within JBoss AS - which version of AS are you using? ModeShape 3 + AS7.1 is a great combination.

              My biggest concerns when using a JCR repository are 1) Perfomance with large data and big sets of subnodes, 2) Transaction management when using a repository from EJBs. Had a rough experience with that aspects in the past, so any improvements are of my interest.

               

              Good to hear.

               

              Regarding performance, is your use case typically write- or -read-intensive, and are a given set of nodes typically read by many sessions/users or might each session/user have its own content area?

               

              Are you using container-managed transactions? We're part-way there with 3.0, especially considering that the storage layer already is using transactions, but haven't yet implemented XAResource or a JCA adapter (see MODE-1498).

              • 4. Re: Testing subnodes
                edcorners

                 

                which version of AS are you using?

                 

                I'm using JBoss AS 6.1 currently, but there's no problem on switching to AS 7.

                 

                Regarding performance, is your use case typically write- or -read-intensive, and are a given set of nodes typically read by many sessions/users or might each session/user have its own content area?

                 

                All users work against the same workspace, I was using the same session for every user but I've acknowledged that is not convenient so I'd be working with a session per user. Often nodes are created or changed (parent changes even depending on user decisions) more than consulted, then I would say my case is more a write-intensive scenario. Many users could be working over a set of subnodes, anyway, only one user per node.. but that could change in the future according to bussiness rules.

                 

                Are you using container-managed transactions? We're part-way there with 3.0, especially considering that the storage layer already is using transactions, but haven't yet implemented XAResource or a JCA adapter (see MODE-1498).

                 

                Yes I use container-managed transactions. I checked the modeshape3-infinispan designed solution and it looks promising :).

                • 5. Re: Testing subnodes
                  rhauch

                  All users work against the same workspace, I was using the same session for every user but I've acknowledged that is not convenient so I'd be working with a session per user. Often nodes are created or changed (parent changes even depending on user decisions) more than consulted, then I would say my case is more a write-intensive scenario. Many users could be working over a set of subnodes, anyway, only one user per node.. but that could change in the future according to bussiness rules.

                  In case you're not aware, the JCR specification states that Session instances are not required to be threadsafe, and in most implementations (including ModeShape 2) they are not? Instead, the specification says that Session instances be lightweight enough that you can create a new Session with each request. (An alternative is to use some pool of Session objects; just be sure to clear out any unsaved transiently-modified state before returning a Session to the pool).

                   

                  Not being threadsafe does create issues for some applications, though. So we explicitly designed ModeShape 3 Sessions so that they are indeed threadsafe. So multiple threads can very easily and safely use a single ModeShape 3 Session object to read content, and the content will automatically and correctly reflect state, even as it's changed by other sessions. However, because each Session does maintain the modified transient state and persists it only upon Session.save(), your application should never have multiple threads use a single Session object to write/update content, since the modified transient state created by one thread will be visible to other threads (even before save), and the interleaving of Session.save() operations can cause the persisted data to be inconsistent from a business-logic perspective.

                   

                  As for container-managed transactions, I do hope to have this working in a few days.

                  • 6. Re: Testing subnodes
                    edcorners

                    Hi,

                     

                    Thanks for the advice on sessions. I've been checking connector and performance optimizing recommendations but i'm still not sure about a few things. My repository is the kind that "is mostly storing files, but needs to decorate the files with custom metadata". It has over 1,5 TBytes of files now. I want my new repository to support this the best way possible giving priority to performance, but still providing file integrity.

                     

                    Now with modeshape 3, I want to work with RESTful API, but i'm very new to it. Is that approach proper for high loads of data? or should I keep working with a JNDI connection?.

                     

                    Finally, I cant find a REST client .jar for modeshape 3. Should I use the 2.8 version of it?

                    • 7. Re: Testing subnodes
                      rhauch

                      Now with modeshape 3, I want to work with RESTful API, but i'm very new to it. Is that approach proper for high loads of data? or should I keep working with a JNDI connection?.

                       

                      The biggest issue we have with the RESTful API is that large BINARY values are not handled well for GETs, PUTs or POSTs, since they have to be embedded within the JSON representation of the nodes. We need to enhance the API to support uploading and downloading files via a streaming mechanism. Downloading would be straightforward, since we can simply embed a URL in the JSON representation. Uploading is the challenge, since that would consitute multiple requests and (as it stands now) separate sessions for each request. Most likely, we'll have to handle multi-part requests.

                       

                      The RESTful API was always intended to be a general-purpose API for remote clients. That may not be the best approach for all use cases. For example, in many cases a domain-specific RESTful interface would be much better, and that'd be pretty easy to write using JAX-RS that map from a specific repository structure to JSON or XML. (In fact, it'd probably be even simpler than our RESTful service implementation.)

                      Finally, I cant find a REST client .jar for modeshape 3. Should I use the 2.8 version of it?

                      It's not yet been included in the alpha releases, but we hope to include it soon. But I would expect the 2.8 version of the RESTful client should work (as long as you use 2.8-only JARs on the client), since the RESTful API hasn't changed since 2.7 (IIRC).