1 2 3 4 5 Previous Next 68 Replies Latest reply on Sep 13, 2010 9:09 AM by clebert.suconic Go to original post Branched to a new discussion.
      • 15. Re: Journaling errors
        clebert.suconic

        >> Are you starting the test with a clean directory?

         

        I could answer that question myself as your test is validating sequences.. so you can only be doing it with a cleaned journal / data directory.

         

         

        I still have the other two questions:

         

        - How Long

        - Are you sure you are using the updated version

        • 16. Re: Journaling errors
          ronnys

          Hi Clebert,

           

          >> Are you starting the test with a clean directory?

           

          I could answer that question myself as your test is validating sequences.. so you can only be doing it with a cleaned journal / data directory.

           

          Yes.

           

          - How Long

           

          How long did it take to fail? It stopped processing messages completely after 22h, but the first errors (the ones I posted Friday afternoon CET) occured with 1 or 2 hours. Or have you meant something else?

          - Are you sure you are using the updated version

           

          Yes. I explicitely built it from your SVN trunk and the errors are different now.

           

          Best regards,

          Ronny

          • 17. Re: Journaling errors
            clebert.suconic

            I can't replicate the ones you posted Friday your afternoon.

             

            As a matter of fact... I just woke up and came to look at the test... everything is fine after 10 Hours run. and that's about the third time I put the test to run.

             

             

            If you could please double check the version you posted (configs.. etc) against your latest post

            • 18. Re: Journaling errors
              ronnys
              I can't replicate the ones you posted Friday your afternoon.

               

              As a matter of fact... I just woke up and came to look at the test... everything is fine after 10 Hours run. and that's about the third time I put the test to run.

               

               

              If you could please double check the version you posted (configs.. etc) against your latest post

               

              The posted data looks fine. I just checked out the latest trunk revision (9426) and will run the test again over night my time.

              • 19. Re: Journaling errors
                clebert.suconic

                One important information for me: What OS version are you using?

                 

                Can you also describe what kind of hardware you're testing on please? Maybe we have a failure that won't happen on a XYZ combination.

                • 20. Re: Journaling errors
                  ronnys

                  Tested the latest trunk revision tonight. The test is still running, but a lot of error messages have been created. Log file attached.

                  One important information for me: What OS version are you using?

                   

                  SUSE Linux Enterprise Server 10 SP2 (x86_64); Kernel 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008

                  Can you also describe what kind of hardware you're testing on please? Maybe we have a failure that won't happen on a XYZ combination.

                   

                  Dell PowerEdge R200 w/ 1 x Intel Xeon X3360 (Quad core), 8 Gb memory

                  • 21. Re: Journaling errors
                    clebert.suconic

                    Thanks Ronny,

                     

                     

                    I did a lot of tests on trying to replicate it.. and I didn't have much success. I may try a different environment with more hardware power... allthough the one I have now is pretty much similar to yours, only difference is I'm using Fedora. But the kernel version and CPU power is similar to yours.. (that's why I asked what env you had).

                     

                    I also added a stress test that will play all the possible combinations with adds, updates.. tx.. etc and starting compacting, to make sure there isn't any tricky situation I didn't handle. It's passing

                     

                    I also, changed an internal writing operation on compacting, to avoid any possible races on the OS with writing. I don't have any indication the problem is here... but the change won't hurt anyway.

                     

                    I will leave the test running here again.. and I will see how it behaves overnight. If you could do the same... (please make sure you do a ./build.sh clean after the update... just to be safe)

                     

                     

                    Thanks for the help

                     

                     

                    Clebert

                    • 22. Re: Journaling errors
                      ronnys

                      Hi Clebert,

                      I did a lot of tests on trying to replicate it.. and I didn't have much success. I may try a different environment with more hardware power... allthough the one I have now is pretty much similar to yours, only difference is I'm using Fedora. But the kernel version and CPU power is similar to yours.. (that's why I asked what env you had).

                       

                      I also added a stress test that will play all the possible combinations with adds, updates.. tx.. etc and starting compacting, to make sure there isn't any tricky situation I didn't handle. It's passing

                       

                      I also, changed an internal writing operation on compacting, to avoid any possible races on the OS with writing. I don't have any indication the problem is here... but the change won't hurt anyway.

                       

                      I will leave the test running here again.. and I will see how it behaves overnight. If you could do the same... (please make sure you do a ./build.sh clean after the update... just to be safe)

                       

                      Ok, will keep the test running. Btw, the likelihood of this issue to appear might be related to the disk I/O subsystem performance. We ran the test on an SSD with less (but still some) errors. It might help if you can somehow limit the disk I/O bandwidth for the HornetQ process if that's possible.

                       

                      Best regards,

                      Ronny

                      • 23. Re: Journaling errors
                        clebert.suconic

                        a good news is I could replicate the error.. and I found a bug on readJournal that is used on compactor. That could lead to errors on compacting.

                         

                         

                        I just committed this fix and I am putting your test to run again. Lets see.

                        • 24. Re: Journaling errors
                          clebert.suconic

                          I'm 5 hours on this test.. so far so good.

                           

                          I am leaving the test running overnight on three different computers.

                           

                          - My Laptop (limited disk)

                          - I have a workstation for tests at home.. 4G, SCSI disk... 8 cores .. so a medium box.

                          - one instance of my performance lab (a powerful box with RAID disks... and a much faster CPU)

                           

                           

                          Lets see how it goes overnight.

                          • 25. Re: Journaling errors
                            ronnys

                            Thanks a lot!

                             

                            The test with revision 9426 looks good as well: There were some errors logged this night, but no new kinds of errors (I'm attaching the log file anyway). The test is still running after ~ 1.35 billion messages, no messages reported as lost or duplicate yet. I'm going to stop it now for some other tests and will start it again this afternoon with an updated load test tool and the latest HornetQ revision.

                            • 26. Re: Journaling errors
                              clebert.suconic

                              Those are definitely gone:

                               

                               

                              java.nio.BufferUnderflowException
                                      at java.nio.Buffer.nextGetIndex(Buffer.java:474)
                                      at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:205)
                                      at org.hornetq.core.journal.impl.JournalImpl.readJournalFile(JournalImpl.java:501)
                                      at org.hornetq.core.journal.impl.JournalImpl.compact(JournalImpl.java:1536)
                                      at org.hornetq.core.journal.impl.JournalImpl$5.run(JournalImpl.java:2334)
                                      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                                      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                                      at java.lang.Thread.run(Thread.java:619)
                              [Thread-3 (group:HornetQ-server-threads1623247029-1609377764)] 19:35:00,768 WARNING [org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl]  Error on deleting duplicate cache
                              java.lang.IllegalStateException: Cannot find add info 68065214
                                      at org.hornetq.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:918)
                                      at org.hornetq.core.persistence.impl.journal.JournalStorageManager.deleteDuplicateID(JournalStorageManager.java:542)
                                      at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl.addToCacheInMemory(DuplicateIDCacheImpl.java:177)
                                      at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl.access$000(DuplicateIDCacheImpl.java:42)
                                      at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.process(DuplicateIDCacheImpl.java:218)
                                      at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.afterCommit(DuplicateIDCacheImpl.java:238)
                                      at org.hornetq.core.transaction.impl.TransactionImpl.afterCommit(TransactionImpl.java:436)
                                      at org.hornetq.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:275)
                                      at org.hornetq.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:207)
                                      at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96)
                                      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                                      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                                      at java.lang.Thread.run(Thread.java:619)

                               

                               

                              I couldn't yet replicate those ones, but they might be related:

                               

                               

                              [pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312022 during compacting
                              [pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312023 during compacting
                              [pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312024 during compacting
                              [pool-4-thread-1] 19:10:43,031 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312025 during compacting

                               

                               

                               

                              Also: you will need to update the client libraries on your next tests also.

                              • 27. Re: Journaling errors
                                clebert.suconic

                                I'm still expecting those at your tests:

                                 

                                 

                                [pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312022 during compacting
                                [pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312023 during compacting
                                [pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312024 during compacting

                                 

                                 

                                 

                                As I could replicate it. However it took me 10 hours last time.

                                 

                                 

                                I'm debugging it now.. have added some logging to my copy and lets see if I can get any info. I'm sure I will fix it.. but it could take me a another day or two as this takes time to be replicated. (I feel like a scientist putting his mice to run and take results.. it just takes time).

                                 

                                But I'm still accepting help if you have any other tests that will replicate it easier :-)

                                • 28. Re: Journaling errors
                                  clebert.suconic

                                  BTW: those *seem* harmless so far. But I want to understand it and fix them.

                                  • 29. Re: Journaling errors
                                    ronnys

                                    Hi Clebert,

                                     

                                    just ran my load test against r9453 (I used r9457 before, but this one introduces some
                                    other issues; the client was not always able to connect to the server thru the looked
                                    up connection factory and messages were disappearing). Unfortunately, r9453 has some
                                    issues, too; especially a memory leak. Log file attached.

                                     

                                    r9453 fails for me with an OOM error after ~30min; I ran the test against r9426 without
                                    OOM error to verify that the updated test isn't causing the issue. A leak report is
                                    attached; I can send you the heap dump as well if you've a FTP/SFTP location I can
                                    drop it to.

                                     

                                    Best regards & have a nice weekend,
                                    Ronny