7 Replies Latest reply on Mar 5, 2013 5:44 AM by paulpa63

    Indexing Problem with Infinispan 5.2.1 (updated) - Continues

    paulpa63

      Hi,

       

      This problem report is going to appear a little vague, but I have spent a day trying to diagnose / trace it without much success, so thought it was worth seeing if anybody else has seen a similar problem and/or might have any clues/ideas as to what is going on...

       

      Our application has been developed on Infinispan 5.1.5, it makes use of indexing to retrieve objects and defines transactions on the caches.  We have recently been evaluating Infinispan 5.2 / 5.2.1 because this has some fixes related to indexing and transactions which we really need.  We have a test data generator which we use to populate our caches for testing.  Under 5.1.5 this has always operated without problems.  With 5.2.1 we are now seeing errors from the Lucene indexing system which causes data loading to fail and consequently the application to be unusable.  Specifically these errors are:

       

      20857 [Lucene Merge Thread #0 for index com.hp.ampa.dd.domain.AbstractObject] ERROR org.hibernate.search.exception.impl.LogErrorHandler - HSEARCH000058: HSEARCH000118: Exception during index Merge operation

      org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: /tmp/lucene4/com.hp.ampa.dd.domain.AbstractObject/_fd.cfs (Too many open files)

      at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:509)

      at org.hibernate.search.backend.impl.lucene.overrides.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:58)

      at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

      Caused by: java.io.FileNotFoundException: /tmp/lucene4/com.hp.ampa.dd.domain.AbstractObject/_fd.cfs (Too many open files)

      at java.io.RandomAccessFile.open(Native Method)

      at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)

      at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.<init>(SimpleFSDirectory.java:71)

      at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.<init>(SimpleFSDirectory.java:98)

      at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.<init>(NIOFSDirectory.java:92)

      at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:79)

      at org.apache.lucene.index.CompoundFileReader.<init>(CompoundFileReader.java:65)

      at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:75)

      at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:116)

      at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:696)

      at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4238)

      at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3908)

      at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)

      at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

       

      Initially I tried experimenting with the various configuration properties for Lucene / Hibernate Search file indexes, e.g. merge_factor, max_merge_docs, etc.  I tried most of them and with little effect, but the problem is more fundamental...

       

      I then switched to monitoring the application usage of file handles by running the Linux "lsof" while the data load is in progress and I can see that the count for our process increases linearly as objects are added, until it presumably exceeds the "ulimit -n" setting.  If I monitor the same situation when 5.1.5 is in use then the count of open files displayed by lsof is stable.  The only difference is switching between Infinispan 5.1.5 and 5.2.1.

       

      The test involves injecting 10,000 objects into the cache, although for the 5.2.1 attempt only a couple of hundred objects are successfully processed before the exceptions start.  To add a little further confusion into the discussion, I have found that within our set of object types/caches, this problem affects some but not others - even though their configurations are identical.  Obviously the objects have different structures and are all defined as @Indexed with various fields marked as @Field.

       

      Runtime environment is Java 1.6.0_32 on RHEL 5.6

       

      Any suggestions / ideas are appreciated,

       

      Paul

        • 1. Re: Indexing Problem with Infinispan 5.2.1 (updated)
          paulpa63

          OK,

           

          I have today taken my application apart line-by-line and traced the single statement which was triggering this problem.  The method which creates a new object of the offending type includes an index query to check whether there is a similar object already present in the cache.  It appears that as the number of objects increased / the index grew, the query was taking longer and longer (i.e. in relation to the object creation - which was running in a tight loop?) until for some reason I do not understand the background Lucene indexer began running out of file handles.  This seems to only affect Infinispan >= 5.2  Now that I have identified the trigger statement in my application I realise that it was very inefficient because I was using an existing query (convenience) which actually retrieved the underlying set of objects - whereas I actually only needed to check the count / sizeOf.  So the fix was painless, but the full explanation of why this only started to behave badly in Infinispan >= 5.2 (and whether there are other "danger" scenarios now) remains a bit of a mystery.

           

          HTH

           

          Paul

          1 of 1 people found this helpful
          • 2. Re: Indexing Problem with Infinispan 5.2.1 (updated)
            paulpa63

            Hi,

             

            The problem with Hibernate Search / Lucene indexing under Infinispan 5.2.1 is still affecting our application and currently it means that 5.2.1 is unusable. 

             

            My previously described correction to our application enables the data load to complete during initialisation of our stress test.  However, when I then let the stress test execute (driven by JMeter) we reach about 1000 samples/operations and then the index errors as described in my original post reappear.  Please note once more that this fault only happens with 5.2.1, if I switch Infinispan to 5.1.5 then I have successfully run the stress test over a weekend reaching >200,000 samples and I have no doubt it would run indefinitely in that configuration.  When running against 5.2.1 I have again been monitoring the Linux file utilisation of our server JVM using lsof; when looking at the differences between successive runs of this command I see lots of newly created Lucene index files and also lots of index files with a "(deleted)" at the end.

             

            My opinion is that this problem may be related to the patch introduced, at my request, to fix the management of remote indexes when running in a transactional context - see discussion thread: https://community.jboss.org/thread/205234  I think that something in this patch is causing the Lucene indexer to not properly clean-up its index files.

             

            Paul

            • 3. Re: Indexing Problem with Infinispan 5.2.1 (updated)
              paulpa63

              I have now managed to produce a minimal test program which demonstrates this problem, see code attached.  (The code had to be transcribed due to restrictions in my work environment so may contain minor syntax errors). 

               

              The test environment is Java 1.6.0_32 running on RHEL 5.6.  If this test is first run against Infinispan 5.1.5 and the index files are monitored using "lsof -p <java_process_id>" then it will be seen that the number of files is reasonably stable (in my environment the number fluctuates around 90).  However, if the test is run against Infinispan 5.2.1 then it will be seen that the number of files quickly ramps up to > 1000 and there are many index entries with "(deleted)".  Surprisingly in this simple test the program does not crash or start reporting the exceptions described in my original post, but the other symptoms are identical.

              • 4. Re: Indexing Problem with Infinispan 5.2.1 (updated)
                sannegrinovero

                Hi Paul,

                thanks a lot for the test, this looks like quite critical, seems a file handler leak.

                 

                I've opened

                https://issues.jboss.org/browse/ISPN-2874

                1 of 1 people found this helpful
                • 5. Re: Indexing Problem with Infinispan 5.2.1 (updated)
                  paulpa63

                  Hi Sanne,

                   

                  Thanks for taking a look at this problem.  I hope there may be simple solution... we really want to move on to Infinispan 5.2.1 but at the moment this issue will prevent us from upgrading.

                   

                  Paul

                  • 6. Re: Indexing Problem with Infinispan 5.2.1 (updated)
                    sannegrinovero

                    Hi Paul,

                    the issue is resolved; you can either apply the patch to your copy of Infinispan (it's linked in the jira issue), wait for the next release or simply avoid using the "enhanced for" loop: please see the description on JIRA, the problem is that 5.2 was defaulting the Iterator to LAZY fetch mode, whereas in 5.1.5 this was using EAGER fetch so that's the cause of your problem. Just specify EAGER explicitly will also work around the issue.

                    • 7. Re: Indexing Problem with Infinispan 5.2.1 (updated)
                      paulpa63

                      Hi Sanne,

                       

                      That was fast work, an interesting little "gotcha"!  I can confirm that with your patches to CacheQuery and CacheQueryImpl applied to our Infinispan 5.2.1 environment I can now execute our stress test and the number of open file handles for the process remains stable.

                       

                      Thanks,

                       

                      Paul

                       

                      PS For interest, I did think of experimenting with the other CacheQuery Iterator which allows the fetch mode to be specified when attempting to diagnose the problem my self but I assumed that the default would be EAGER and in my experiments only set the value to LAZY :-(