5 Replies Latest reply on Jul 2, 2013 12:32 PM by rhauch

    General questions on indexes and nodes

    sebastien.michea

      I have few question for which i didnt find the answer in the doc.

      In all those questions i just consider the non cluster mode.

       

      Indexes :

      I understand we can persist indexes in different way (cache, files, DB,...) but is this persistence used only when we shutdown the server ?

      I mean when the server is up, are indexes completly loaded in memory ?

       

      Metadata (nodes) :

      I saw how to store binary files and indexes but for all the rest, everything is stored in infinispan cache , correct ?

      Does it mean that everything is in memory ?

       

      Now a question with the cluster mode :

      can we switch from a non clustered jboss instance to a clustered replication topology by simply changing some configuration and restarting the servers or we need to backup and restore the repository ?

       

      Thank you in advance !

        • 1. Re: General questions on indexes and nodes
          rhauch

          Indexes :

          I understand we can persist indexes in different way (cache, files, DB,...) but is this persistence used only when we shutdown the server ?

          I mean when the server is up, are indexes completly loaded in memory ?

          No, the indexes are not completely loaded into memory. If you really need that and your repository is relatively small, consider using RAM storage for the indexes and completely rebuilding the indexes each time ModeShape is restarted.

           

          For other non-clustered cases, storing the indexes on the file system will likely be sufficient and quite efficient.

           

           

          Metadata (nodes) :

          I saw how to store binary files and indexes but for all the rest, everything is stored in infinispan cache , correct ?

          Does it mean that everything is in memory ?

          All the node content can be stored only in-memory (i.e., no other persistance), though that's really only suggested for larger clusters where you can configure Infinispan to store multiple copies across your cluster. See our documentation about topologies.

           

          For other non-clustered and (not large) clustered topologies, you probably do want to persist the node data. And, yes, you configure this through Infinispan and its cache stores. See here and here.

           

          Note that both ModeShape and Infinispan have caches, so that recently-accessed data is kept in memory as much as possible.

           

          As for binaries, we recommend that you at least use the file system, a database, or MongoDB. You generally do not want to store binaries in-memory (on the heap), because they can be very large.

           

           

          Now a question with the cluster mode :

          can we switch from a non clustered jboss instance to a clustered replication topology by simply changing some configuration and restarting the servers or we need to backup and restore the repository ?

          Yes, you can make the switch. But it will take a bit of effort to make sure the new server's configurations are compatible. For example, be sure to start up the now-clustered first server that has access to the existing persistent content (nodes, binaries, indexes). That way, when the other processes are started, the state from the first process will be transferred to the others, and any shuffling (for Infinispan's distributed mode) will happen at that time.

          1 of 1 people found this helpful
          • 2. Re: General questions on indexes and nodes
            sebastien.michea

            Thank you very much for your answer Randall.

             

            It took me quite some time to study more about infinispan and slowly by slowly i start to figer out how it works.

             

            My current configuration is :

               non-cluster mode,

               index stored in filesystem:

            <subsystem xmlns="urn:jboss:domain:modeshape:1.0">

                        <repository name="nbeportal" cache-name="nbeportal" cache-container="modeshape">

            <local-file-index-storage

                          path="modeshape/nbeportal/indexes"

                          relative-to="jboss.server.data.dir"

                          access-type="auto"

                          locking-strategy="native">

                </local-file-index-storage>

            </repository>

                    </subsystem>

             

               node content stored in filesystem :

             

                            <local-cache name="nbeportal">

                                <transaction mode="NON_XA"/>

                                <eviction strategy="LIRS" max-entries="10000"/>

                                <file-store relative-to="jboss.server.data.dir" path="modeshape/store/nbeportal" passivation="false" purge="false"/>

                            </local-cache>

             

            Then i still have few questions :

             

            1) How can i know and configure the amount of memory that will be used by the indexes loaded in memory ?

             

            2) I have to do some reporting on the nodes, for instance counting how many nodes have a certain property set to some value. Looks like there is no way to do aggregation in the JCR queries, so we endup loading all the nodes and doing aggregation in java. Its unfortunately really slow and often throw outOfMemory error. What is the best way to generate the report ?
            We are thinking of configuring the infinispan cache so that node are stored in database so we can query the database directly, but is there a way to migrate the current content of cache to database or we will loose everything ?

             

            3) When querying nodes using limit (using 3.1.1.Final on jboss 7.1.1) , it seems that all nodes are loaded in memory (in log i see they are "materialized") independently of the value of the limit, is that normal ?

             

            4) We are trying to migrate to 3.3.0.Final on jboss EAP 6.1 but when we copy the nodes stored by infinispan we get an error at startup when rebuilding the indexes :

            11:21:25,885 ERROR [org.infinispan.loaders.file.FileCacheStore] (ServerService Thread Pool -- 56) ISPN000062: Error while reading from file: /Users/babyfooter/Documents/jboss-eap-6.1/standalone/data/modeshape/store/nbeportal/nbeportal/2121488384: java.io.StreamCorruptedException: Unexpected byte found when reading an object: 0

                      at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:750)

            when trying the same procedure on another jboss 7.1.1 it works correctly. Any idea ?

             

             

            Thank you

            • 3. Re: General questions on indexes and nodes
              rhauch

              Then i still have few questions :

               

              1) How can i know and configure the amount of memory that will be used by the indexes loaded in memory ?

              Since you're storing your indexes on the file system, ModeShape will use Lucene's FSDirectory (or one of its subclasses based upon the "access-type" XML attribute on the "local-file-index-storage" XML element in the AS7/EAP configuration, or the "fileSystemAccessType" field in the JSON configuration file for non-AS7/EAP uses). I'm not sure that Lucene allows us to say how much memory will be used for the indexes (I suspect relatively little, since IIRC the files are used when responding to queries).

               

              ModeShape, on the other hand, will materialize nodes as it processes criteria. (The reason is a little arcane and has to do with the JCR spec's requirement that a session's queries see the session's view of all the nodes.) So, yes, if your query is accessing a lot of nodes, you may run into a problem.

               

              One way to address that is to configure each workspace's cache. Internally, ModeShape maintains a shared cache of nodes for each workspace, and every session using a particular workspace will share that workspace's cache. (This workspace cache is really a cache in the truest sense of the word, and it is not the same Infinispan "cache" used for persisting the content of a repository.) By default, ModeShape uses an in-memory Infinispan cache with eviction and expiration, and this default Infinispan configuration is loaded from a file in ModeShape's JARs. The limit is 10K nodes, which is probably too high for your nodes. The good news is that you can override this default configuration; the bad news is that it's not terribly straightforward, and the way Infinispan is configured in AS7/EAP makes it even harder (since you have to explicitly define each cache).

               

              In the EAP configuration:

               

              1. Define a new cache container in the Infinispan subsystem that you'll use expressly for these workspace caches. For example, "nbeportalWorkspaceCaches".
              2. Define a local in-memory cache for each of your workspaces inside this cache container, using expiration and eviction values and no cache store and no transactions, and name this cache to be "yourRepositoryName/yourWorkspaceName" (obviously using the actual repository and workspace names). Something like this for the "default" workspace (which is equivalent to our default configuration), but with lower values:

                                <local-cache name="nbeportal/default">

                                    <transaction mode="NON_XA"/>

                                    <eviction strategy="LIRS" max-entries="10000"/>

                                    <expiration max-idle="120000"/>

                                </local-cache>

              3. In your repository configuration, define the workspaces and set the cache container to the name of the cache container you set up in step 1. For example:

                      <subsystem xmlns="urn:jboss:domain:modeshape:1.0">

                          <repository name="nbeportal">

                               <workspaces cache-container="nbeportalWorkspaceCaches" >

                                  <workspace name="default" />

                               </workspaces>

                                ...

                          </repository>

               

              When the repository starts up, it will look for an existing cache named "{repositoryName}/{workspaceName}" in the designated cache container, and it will use that for the workspace's cache of nodes. The "max-entries" is what limits the size that will be kept in the cache.

               

              For other readers not using AS7/EAP and instead using JSON repository configuration files, setting this up is much easier. Simply define the location of the Infinispan configuration file (that will look something like the default configuration file) in the "cacheConfiguration" field under the "workspaces" nested document. For example:

               

              {

                  "name" : "MyRepository",

                  "workspaces" : {

                      "predefined" : ["otherWorkspace"],

                      "default" : "default",

                      "cacheConfiguration" : "path/to/infinispan-config.xml",

                      "allowCreation" : true

                  },

                  ...

              }

               

               

              2) I have to do some reporting on the nodes, for instance counting how many nodes have a certain property set to some value. Looks like there is no way to do aggregation in the JCR queries, so we endup loading all the nodes and doing aggregation in java. Its unfortunately really slow and often throw outOfMemory error. What is the best way to generate the report ?

              We are thinking of configuring the infinispan cache so that node are stored in database so we can query the database directly, but is there a way to migrate the current content of cache to database or we will loose everything ?

              Unfortunately, JCR nor ModeShape defines aggregate functions, so the best way to do this is to walk the structure yourself. Is this something you can do in the background, or is it important that it be done quickly and immediately?

               

              Configuring and tuning the workspace cache as described above will also address the OutOfMemory problem.

               

              I'm not so sure that storing the content inside a database will help, since ModeShape (rather, Infinispan's JDBC cache stores) will store the node definition in BLOB form.

               

               

              3) When querying nodes using limit (using 3.1.1.Final on jboss 7.1.1) , it seems that all nodes are loaded in memory (in log i see they are "materialized") independently of the value of the limit, is that normal ?

              Yes, it's normal, although it technically depends on what other contraints you use in the query. Most such limit queries will require materialization.

               

               

              4) We are trying to migrate to 3.3.0.Final on jboss EAP 6.1 but when we copy the nodes stored by infinispan we get an error at startup when rebuilding the indexes :

              11:21:25,885 ERROR [org.infinispan.loaders.file.FileCacheStore] (ServerService Thread Pool -- 56) ISPN000062: Error while reading from file: /Users/babyfooter/Documents/jboss-eap-6.1/standalone/data/modeshape/store/nbeportal/nbeportal/2121488384: java.io.StreamCorruptedException: Unexpected byte found when reading an object: 0

                        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:750)

              when trying the same procedure on another jboss 7.1.1 it works correctly. Any idea ?

              No, I'm not sure what might be the culprit here. We try very hard to keep those files backward compatible. Can you provide more of the stack trace?

               

              The challenge is that ModeShape 3.1.1 use Infinispan 5.1.x, while ModeShape 3.3.0 uses Infinispan 5.2.x. That means the culprit could be in either one. I'd suggest looking at our backup and restore feature to backup the 3.1.1 instance and then to restore it using the 3.3.0 instance.

              • 4. Re: General questions on indexes and nodes
                sebastien.michea

                Thank you for the answer Randall.

                 

                one of its subclasses based upon the "access-type" XML attribute on the "local-file-index-storage"

                 

                where can i find the possible values ? i guess i would like to use the NIOFSDirectory ...

                 

                 

                About workspace cache configuration, in your example

                                    <eviction strategy="LIRS" max-entries="10000"/>

                This is the value i should lower, right ?

                In order to understand its usage, lets imagine i set the number of max-entries to 1, then when counting the number of nodes satisfying some criteria, modeshape will use the index to get the ids of the nodes then for each of them the node will be materialized (i.e. loaded in the workspace cache ?) then the increment the result, then the entry will be evicted and the process continue with next node. is this correct ? so independently of how many node i have to count this should not increase the memory usage.

                About aggregation :

                is it important that it be done quickly and immediately?

                For the reporting it is not necessary to do it quickly, i could store the result then visualize it later.

                A problem comes when i try to display documents of a customer (documents are extension of nt:files in some nt:directory tree).
                I have a screen that allow to do some search using custom criteria on the indexed fields of the documents.

                The result is displayed in a datatable using pagination.

                But in order to get the pagination i have to count how many documents are returned by the query. This is slow when i have lots of nodes to display (around 10k).

                 

                I'm experimenting with the workspace cache. but i had a problem with the configuration you proposed, jboss complain that i cannot use transaction in the default workspace, i had to set

                <local-cache name="nbeportal/default">

                                    <transaction mode="NON_XA"/>

                 

                Thank you for the link to the backup/restore, i'll try it. i'll also provide more of the stacktrace.

                • 5. Re: General questions on indexes and nodes
                  rhauch

                  Thank you for the answer Randall.

                   

                  one of its subclasses based upon the "access-type" XML attribute on the "local-file-index-storage"

                  where can i find the possible values ? i guess i would like to use the NIOFSDirectory ...

                  Since you're using AS7/EAP, the possible values are defined in the XSD for the ModeShape subsystem. The attribute is defined and documented here, and the enumeration of values is defined here.

                   

                   

                  About workspace cache configuration, in your example

                                      <eviction strategy="LIRS" max-entries="10000"/>

                  This is the value i should lower, right ?

                  Yes. I'd try 1000 to start with, but definitely play with different values. Note that the effect will be dependent upon the total member allotted to AS7/EAP. And, ModeShape really excels the more memory that is available.

                  In order to understand its usage, lets imagine i set the number of max-entries to 1, then when counting the number of nodes satisfying some criteria, modeshape will use the index to get the ids of the nodes then for each of them the node will be materialized (i.e. loaded in the workspace cache ?) then the increment the result, then the entry will be evicted and the process continue with next node. is this correct ? so independently of how many node i have to count this should not increase the memory usage.

                  Yes, your summary is accurate. When a node is materialized, it's binary serialized form is read from persistent storage and converted into an in-memory object representation, and this in-memory object representation is placed inside the workspace cache. So, the downside of having too small a max-entries for the workspace cache is that you'll be doing a lot of "paging" (evict some node in the cache, then read the serialized form, convert into the object representation, and put that into the workspace cache) that will decrease throughput. Of course the downside of too large a max-entries value is OutOfMemory exceptions.

                   

                  Note that it doesn't matter how many sessions are actively reading content from the same workspace, since all sessions for a given workspace will share the same workspace cache. However, if you have lots of sessions reading/scanning/querying lots of different nodes, the nodes they work with will compete for space inside the workspace cache, and the result will be excess paging and decreased throughput.

                   

                   

                  About aggregation :

                  is it important that it be done quickly and immediately?

                   

                  For the reporting it is not necessary to do it quickly, i could store the result then visualize it later.

                  A problem comes when i try to display documents of a customer (documents are extension of nt:files in some nt:directory tree).
                  I have a screen that allow to do some search using custom criteria on the indexed fields of the documents.

                  The result is displayed in a datatable using pagination.

                  But in order to get the pagination i have to count how many documents are returned by the query. This is slow when i have lots of nodes to display (around 10k).

                  This is where LIMIT and OFFSET are useful. You shouldn't need to count the number of documents to page through results, unless you really need to show the total to the user.

                   

                  BTW, do you know that you can get the number of rows from the QueryResult? It's not easy to see in the standard JCR API, but it's there and it will save you from having to iterate through all of the nodes or rows:  queryResult.getRows().getSize()

                   

                   

                  I'm experimenting with the workspace cache. but i had a problem with the configuration you proposed, jboss complain that i cannot use transaction in the default workspace, i had to set

                  <local-cache name="nbeportal/default">

                                      <transaction mode="NON_XA"/>

                  That's the same thing I provided, which I think was a copy/paste error on my part. Did you try the following?

                   

                       <transaction mode="NONE"/>