8 Replies Latest reply on Nov 14, 2012 8:50 AM by multanis

    Total content size: query slow

    multanis

      Hello,

       

      My problem is probably a common issue but I cannot find any useful tip to handle it.

       

      I want to get the total size of all jcr:data node under a folder, here is the query I use:

       

      select node.[jcr:data] from [nt:resource] as node where [jcr:name] = 'jcr:content' and [jcr:path] LIKE '" + path + "/%';
      

       

      Then I iter over the rows like this:

       

       

           RowIterator rowIterator = query.execute().getRows();
              long totalSize = 0;
      
              while (rowIterator.hasNext()) {
                  Row row = rowIterator.nextRow();
                  totalSize = totalSize + row.getValue("jcr:data").getBinary().getSize();
              }
      

       

      It works fine, but then I tried on a folder which is containing 50 sub folders with 100 files each (so 5000 files) and it takes 10 to 18sec to complete.

       

      Any idea how I can improve this ? My goal is to restrict some folder disk usage using quota, so at each write I should check the total disk usage so I cannot afford this query to be so slow.

       

      Modeshape: 2.8.3

      Source connector: FileSystemSource

       

      Thanks !

        • 1. Re: Total content size: query slow
          rhauch

          Then I iter over the rows like this:

               RowIterator rowIterator = query.execute().getRows();
                  long totalSize = 0;
           
                  while (rowIterator.hasNext()) {
                      Row row = rowIterator.nextRow();
                      totalSize = totalSize + row.getValue("jcr:data").getBinary().getSize();
                  }
          

           

          It works fine, but then I tried on a folder which is containing 50 sub folders with 100 files each (so 5000 files) and it takes 10 to 18sec to complete.

           

          The problem is that you're explicitly loading the node, its properties, and asking for the binary value's size for all 5000 files, and I think you're just hitting the performance limit of the FileSystemSource.

           

          One way of getting around this is to store the size as a property on either the "jcr:content" node or the parent "nt:file" node (whichever suits your needs better), and then use that in your queries.

          • 2. Re: Total content size: query slow
            multanis

            Thanks for your (really) fast answer Randall,

             

            I already tried to set a property containing the size, it is a bit better but still not enough (6 to 10 sec):

             

            select node.[" + JCR_SIZE_PROPERTY + "] from [nt:file] as node where [jcr:path] LIKE '" + normalizePath(path) + "/%'
            

             

             

            Query query = manager.createQuery(queryStatement, Query.JCR_SQL2);
            
                    RowIterator rowIterator = query.execute().getRows();
                    long totalSize = 0;
            
                    while (rowIterator.hasNext()) {
                        Row row = rowIterator.nextRow();
                        totalSize = totalSize + row.getValue(JCR_SIZE_PROPERTY).getLong();
                    }
            

             

            Is the FileSystemSource still have something to do when using like this ?

             

            I though the properties where indexed by Lucene, and so I was hoping lucene alone will be able to reply to this query. Am I wrong ?

            • 3. Re: Total content size: query slow
              rhauch

              If I remember correctly, accessing a query result row in 2.x does access the node, which means the file system connector is still involved. This is due to the requirement in JCR that the results have to reflect the node's state within the Session.

              1 of 1 people found this helpful
              • 4. Re: Total content size: query slow
                multanis

                Ok so if I understand correctly, Lucene is used to retrieved the ids of the nodes matching the query then entire nodes are always fetched using the connector.

                After a bit more testing it seems most of the time is spend during the:

                 

                while (rowIterator.hasNext()) {
                            Row row = rowIterator.nextRow();
                            totalSize = totalSize + row.getValue(JCR_SIZE_PROPERTY).getLong();
                 }
                

                 

                and not during the query execution, which confort the idea the Lucene is only used to retrieved ids.

                 

                Any idea how I can improve this ? I guess changing the connector will not help, same sytem will still be used right ?

                 

                Another think, the query execution (which is only involving lucene I guess) is taking 400 to 500 ms, which is better but still huge comparing to the same request directly on a dabatase (10 ms). Is this something I can configure somehow to improve performance or not ?

                 

                Thanks !

                • 5. Re: Total content size: query slow
                  rhauch

                  Yes, Lucene is basically used as an index to return the identifiers of the nodes, but the nodes are always fetched when accessing the nodes.

                   

                  Unfortunately, in 2.x the file system connector does not have a caching mechanism, though even that would not help if the nodes returned by the query haven't been accessed recently.

                   

                  The choice of connector will have a dramatic effect on the performance, since the connector is at the heart of storing and retrieving nodes. The total time for accessing the query results does seem excessive, but then again the file system connector isn't the fastest. Have you tried using a different connector? Perhaps the JPA connector, disk-based connector, or Infinispan connector? I presume you're using the file system connector because you're accessing (as nodes) the files and directories on a file system. If that's not your goal, then you should not be using the file system connector.

                  • 6. Re: Total content size: query slow
                    multanis

                    Yes I'm using File System connector because I want a readable content for my repository (and I already have some files in that file system that I want modeshape to handle).

                    But actually, what I've done is using the FileSystem connector with a custom customPropertiesFactory which is storing and retrieving properties from a (mysql) database. Probably the customPropertiesFactory was not really designed for this usage but it works. However performance are not increased due to this database properties handler.

                     

                    I think I'm gonna by-pass the JCR query system for this request and go directly in the database, not really a good design but I don't find any other acceptable solution.

                     

                    Just for my knowledge as I cannot afford the migration, is using ModeShape 3.x can solve this problem in any way ?

                    • 7. Re: Total content size: query slow
                      rhauch

                      Yes I'm using File System connector because I want a readable content for my repository (and I already have some files in that file system that I want modeshape to handle).

                      But actually, what I've done is using the FileSystem connector with a custom customPropertiesFactory which is storing and retrieving properties from a (mysql) database. Probably the customPropertiesFactory was not really designed for this usage but it works. However performance are not increased due to this database properties handler.

                      I would be surprised if this performs satisfactorily. You might try the query against the file system connector with the default file-based properties factory, just to see the performance overhead of your custom properties factory.

                       

                      Another option might be to use a different (storage) connector, and to import all of the files/directories into the JCR repository directly. Remember, with WebDAV you have the abiliity of treating the repository as a networks file share. (This approach many not be feasible for you; it all depends on whether/how other systems are still directly accessing the file system.)

                       

                      Just for my knowledge as I cannot afford the migration, is using ModeShape 3.x can solve this problem in any way ?

                       

                      ModeShape 3.0 can persist content in a variety of storage technolgoies (the equivalent of 2.x's storage connectors), but it doesn't yet have connectors that access external systems. ModeShape 3.1 (due early next month) will have those connectors and will include a file system connector (among others). But 3.x has a far superior caching system that improves performance by reducing the number of times the connector is asked to load the nodes from the external system. It also is faster than 2.x and does a much better job storing large BINARY values (e.g., file content).

                      • 8. Re: Total content size: query slow
                        multanis

                        Using the default file-based properties factory or mine doesn't change anything on performance (of my query). I cannot use WebDav (or other connector) as I want to keep my repository readable from file system.

                         

                        I will keep using directly the databas so...

                         

                        Thanks for your time !