1 2 3 Previous Next 68 Replies Latest reply on Sep 13, 2010 9:09 AM by clebert.suconic Branched to a new discussion.

    Journaling errors

    ronnys

      Hi,

       

      I've encountered a few strange errors in the HornetQ (2.1.1.final) log, such as

       

      [pool-4-thread-1] 15:53:38,513 SEVERE [org.hornetq.core.journal.impl.JournalImpl]  Inconsistency during compacting: Delete record being read on an existent record
      java.lang.IllegalStateException: Inconsistency during compacting: Delete record being read on an existent record
              at org.hornetq.core.journal.impl.JournalCompactor.onReadDeleteRecord(JournalCompactor.java:305)
              at org.hornetq.core.journal.impl.JournalImpl.readJournalFile(JournalImpl.java:629)
              at org.hornetq.core.journal.impl.JournalImpl.compact(JournalImpl.java:1517)
              at org.hornetq.core.journal.impl.JournalImpl$5.run(JournalImpl.java:2315)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:619)

       

      [Thread-6 (group:HornetQ-server-threads1485988078-970926436)] 15:56:53,925 WARNING [org.hornetq.core.server.impl.QueueImpl]  Unable to remove message id = 2201750 please remove manually
      java.lang.IllegalStateException: Cannot find add info 2201750
          at org.hornetq.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:913)
          at org.hornetq.core.persistence.impl.journal.JournalStorageManager.deleteMessage(JournalStorageManager.java:514)
          at org.hornetq.core.server.impl.QueueImpl.postAcknowledge(QueueImpl.java:1415)
          at org.hornetq.core.server.impl.QueueImpl.access$400(QueueImpl.java:65)
          at org.hornetq.core.server.impl.QueueImpl$RefsOperation.afterCommit(QueueImpl.java:1510)
          at org.hornetq.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:294)
          at org.hornetq.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:207)
          at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:619)

       

      [Old I/O server worker (parentId: 2071272094, channelId: 1292454077, null => ttwsrvdev02.dev.ttw/10.115.151.40:61445)] 15:59:59,093 WARNING [org.hornetq.core.server.impl.QueueImpl]  Error on checkDLQ
      java.lang.IllegalStateException: Cannot find add info 3963098
          at org.hornetq.core.journal.impl.JournalImpl.appendUpdateRecord(JournalImpl.java:841)
          at org.hornetq.core.persistence.impl.journal.JournalStorageManager.updateDeliveryCount(JournalStorageManager.java:689)
          at org.hornetq.core.server.impl.QueueImpl.checkDLQ(QueueImpl.java:946)
          at org.hornetq.core.server.impl.QueueImpl$RefsOperation.afterRollback(QueueImpl.java:1471)
          at org.hornetq.core.transaction.impl.TransactionImpl$3.done(TransactionImpl.java:362)
          at org.hornetq.core.persistence.impl.journal.OperationContextImpl.executeOnCompletion(OperationContextImpl.java:158)
          at org.hornetq.core.persistence.impl.journal.JournalStorageManager.afterCompleteOperations(JournalStorageManager.java:393)
          at org.hornetq.core.transaction.impl.TransactionImpl.rollback(TransactionImpl.java:344)
          at org.hornetq.core.server.impl.ServerConsumerImpl.close(ServerConsumerImpl.java:296)
          at org.hornetq.core.server.impl.ServerSessionImpl.closeConsumer(ServerSessionImpl.java:931)
          at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:447)
          at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:471)
          at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:451)
          at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:412)
          at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(RemotingServiceImpl.java:459)
          at org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.messageReceived(HornetQChannelHandler.java:67)
          at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:100)
          at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
          at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:287)
          at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.decode(HornetQFrameDecoder2.java:169)
          at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.messageReceived(HornetQFrameDecoder2.java:134)
          at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
          at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
          at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:540)
          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
          at org.jboss.netty.channel.socket.oio.OioWorker.run(OioWorker.java:90)
          at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
          at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
          at org.jboss.netty.util.VirtualExecutorService$ChildExecutorRunnable.run(VirtualExecutorService.java:181)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:619)
         
      The log file contains hundreds of these errors. HornetQ is running on Linux with enabled AIO; standalone, unclustered
      installation, data directory on local ext3 FS. A separate load test (source attached) can be used to reproduce them

      within a few minutes when running it on a fresh HornetQ installation with empty data directory. Issues appears with

      NIO journal as well. Config files are attached.

       

      In order to reproduce the issue start the load test with the following options

       

      org.jnp.interfaces.NamingContextFactory jnp://yourserver:yourport ConnectionFactory queue/inbound_messages 3600000 100000 25000 5 25 10 2048 5000 1000000

      Keep it running for 5min or so.

       

      Best regards,

      Ronny

        • 1. Re: Journaling errors
          clebert.suconic

          I really appreciate the test.. thanks. I will work on it.

           

          Thanks

          • 2. Re: Journaling errors
            ronnys

            Hi,

             

            one more exception when sending more messages that caused the client to fail:

             

            [Thread-23 (group:HornetQ-server-threads1485988078-970926436)] 17:10:33,874 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Sending unexpected exception to the client
            java.lang.NullPointerException
                    at org.hornetq.core.journal.impl.JournalTransaction.commit(JournalTransaction.java:285)
                    at org.hornetq.core.journal.impl.JournalImpl.appendCommitRecord(JournalImpl.java:1253)
                    at org.hornetq.core.persistence.impl.journal.JournalStorageManager.commit(JournalStorageManager.java:649)
                    at org.hornetq.core.transaction.impl.TransactionImpl.commit(TransactionImpl.java:266)
                    at org.hornetq.core.transaction.impl.TransactionImpl.commit(TransactionImpl.java:221)
                    at org.hornetq.core.server.impl.ServerSessionImpl.commit(ServerSessionImpl.java:542)
                    at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:306)
                    at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:471)
                    at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:451)
                    at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.access$000(RemotingConnectionImpl.java:43)
                    at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl$1.run(RemotingConnectionImpl.java:391)
                    at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                    at java.lang.Thread.run(Thread.java:619)

             

            Parameters for the load test were

             

            org.jnp.interfaces.NamingContextFactory jnp://yourserver:yourport ConnectionFactory queue/inbound_messages 43200000 2500000 25000 5 25 10 2048 5000 1000000

             

            Best regards,

            Ronny

            • 3. Re: Journaling errors
              ronnys

              Hi Clebert,

               

              thanks a lot for taking a look into these issues.

               

              I've got another exception while running the tests:

               

              [pool-4-thread-1] 09:28:30,650 WARNING [org.hornetq.core.journal.impl.JournalImpl]  NULL
              java.lang.IllegalArgumentException
                      at java.nio.Buffer.position(Buffer.java:218)
                      at org.hornetq.core.journal.impl.JournalImpl.readJournalFile(JournalImpl.java:408)
                      at org.hornetq.core.journal.impl.JournalCleaner.fixDependency(JournalCleaner.java:246)
                      at org.hornetq.core.journal.impl.JournalCleaner.fixDependencies(JournalCleaner.java:188)
                      at org.hornetq.core.journal.impl.JournalImpl.cleanUp(JournalImpl.java:2247)
                      at org.hornetq.core.journal.impl.JournalImpl.access$800(JournalImpl.java:81)
                      at org.hornetq.core.journal.impl.JournalImpl$4.run(JournalImpl.java:2153)
                      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                      at java.lang.Thread.run(Thread.java:619)

               

              Unfortunately, I can't really enforce the errors deterministically by producing exactly the same messages in each test.

               

              Best regards,

              Ronny

              • 4. Re: Journaling errors
                lbershad

                I am seeing a very similar set of errors using 2.1.0.Final (RHEL 5.4, ext3, very vanilla - uses hornetq defaults).  Usually takes 5-10 minutes of serious pounding to occur.

                 

                {noformat}

                Jul 12, 2010 9:57:53 PM org.hornetq.core.logging.impl.JULLogDelegate error
                SEVERE: Inconsistency during compacting: Delete record being read on an existent record
                java.lang.IllegalStateException: Inconsistency during compacting: Delete record being read on an existent record
                        at org.hornetq.core.journal.impl.JournalCompactor.onReadDeleteRecord(JournalCompactor.java:305)
                        at org.hornetq.core.journal.impl.JournalImpl.readJournalFile(JournalImpl.java:629)
                        at org.hornetq.core.journal.impl.JournalImpl.compact(JournalImpl.java:1517)
                        at org.hornetq.core.journal.impl.JournalImpl$5.run(JournalImpl.java:2310)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                        at java.lang.Thread.run(Thread.java:619)
                Jul 12, 2010 9:57:53 PM org.hornetq.core.logging.impl.JULLogDelegate warn
                WARNING: Error on deleting duplicate cache
                java.lang.IllegalStateException: Cannot find add info 333437
                        at org.hornetq.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:913)
                        at org.hornetq.core.persistence.impl.journal.JournalStorageManager.deleteDuplicateID(JournalStorageManager.java:542)
                        at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl.addToCacheInMemory(DuplicateIDCacheImpl.java:177)
                        at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl.access$000(DuplicateIDCacheImpl.java:42)
                        at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.process(DuplicateIDCacheImpl.java:218)
                        at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.afterCommit(DuplicateIDCacheImpl.java:238)
                        at org.hornetq.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:294)
                        at org.hornetq.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:207)
                        at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                        at java.lang.Thread.run(Thread.java:619)
                Jul 12, 2010 9:57:53 PM org.hornetq.core.logging.impl.JULLogDelegate warn
                WARNING: Unable to remove message id = 341732 please remove manually
                java.lang.IllegalStateException: Cannot find add info 341732
                        at org.hornetq.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:913)
                        at org.hornetq.core.persistence.impl.journal.JournalStorageManager.deleteMessage(JournalStorageManager.java:514)
                        at org.hornetq.core.server.impl.QueueImpl.postAcknowledge(QueueImpl.java:1415)
                        at org.hornetq.core.server.impl.QueueImpl.access$400(QueueImpl.java:65)
                        at org.hornetq.core.server.impl.QueueImpl$RefsOperation.afterCommit(QueueImpl.java:1510)
                        at org.hornetq.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:294)
                        at org.hornetq.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:207)
                        at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                        at java.lang.Thread.run(Thread.java:619)
                Jul 12, 2010 9:57:53 PM org.hornetq.core.logging.impl.JULLogDelegate warn
                WARNING: Error on deleting duplicate cache
                java.lang.IllegalStateException: Cannot find add info 333440
                        at org.hornetq.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:913)
                        at org.hornetq.core.persistence.impl.journal.JournalStorageManager.deleteDuplicateID(JournalStorageManager.java:542)
                        at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl.addToCacheInMemory(DuplicateIDCacheImpl.java:177)
                        at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl.access$000(DuplicateIDCacheImpl.java:42)
                        at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.process(DuplicateIDCacheImpl.java:218)
                        at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.afterCommit(DuplicateIDCacheImpl.java:238)
                        at org.hornetq.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:294)
                        at org.hornetq.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:207)
                        at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                        at java.lang.Thread.run(Thread.java:619)

                 

                ....... Last Error repeated hundreds of times ..............

                {noformat}

                • 5. Re: Journaling errors
                  ronnys

                  Hi again,

                   

                  just some impact information. I did some more load tests and compared the message sequence of sent and received messages.

                   

                  • The reported NullPointerException leads to an unresponsive HornetQ server in most cases, MessageConsumer#receive() does not return anymore (at least not within 5min, the server appeared to be idle in the meantime) although there are still messages in the queue. Reconnecting to the HornetQ server helps sometimes. The NPE appears to happen only when committing received messages; havn't seen it for sent messages yet. Although the commit() call fails, a part(!) of or all received messages have been committed actually.
                  • The IllegalStateException thrown from org.hornetq.core.journal.impl.JournalImpl.appendUpdateRecord(JournalImpl.java:841) leads into lost messages.
                  • It takes pretty long for the server to come up after these errors and it consumes a lot of CPU time afterwards even if idle.

                   

                  I havn't seen any impact on the messages caused by the other errors yet, but their presence concerns me anyway

                   

                  One more NPE I just found while investigating the other NPEs is (unfortunately without stacktrace):

                   

                  [Thread-13 (group:HornetQ-server-threads1485988078-970926436)] 15:14:07,136 WARNING [org.hornetq.core.server.impl.QueueImpl]  Unable to remove message id = 2349928 please remove manually
                  java.lang.NullPointerException

                   

                  Best regards,

                  Ronny

                  • 6. Re: Journaling errors
                    clebert.suconic

                    It seems that all the other errors are caused by the IllegalStateException as compacting is interrupted when it happens.

                     

                    The issues I have found so far:

                     

                    I - You are paging with TX, and on the case large transactions, and the page-sizes are small (1 MiB). As a result a TX will have message along different page files, and there's a bug on the paging code. PagingTransaction are leaking, what's making the Journal to compact more often than needed.

                     

                    II - I could add System.outs to the server, and I could see that the bug will happen when an append is done just after the system entered into compact mode. The way routing works is there's an append and an update. I believe the issue will happen when compacting happened between the append and the update. (i'm writing tests for this)

                     

                    I will not fix I until I have fixed II. And when I fix I, I will make sure I have a test to validate I.

                     

                     

                    As a note: this was a bit of a Heinsenbug for me, and that's why I was taking a little longer on fixing it.

                    (http://en.wikipedia.org/wiki/Heisenbug#Heisenbug)

                     

                    Whenever I added System.outs to the server's log, I couldn't replicate any issue due to the syncs on the system.out.

                     

                    I could do some trick and make it fail with logs by doing this:

                     

                    System.setOut(new PrintStream(new BufferedOutputStream(new FileOutputStream("someFile.log")));

                     

                    ^^ note the BufferedOutputStream is important here ^^

                     

                     

                    And adding a bunch of System.out on my local copy.

                     

                     

                     

                    ^^ That will be a trick I will use again later when I have Heinsenbugs like this. It won't always help.. but I will at least try cases like this ^^

                    • 7. Re: Journaling errors
                      ronnys

                      Thanks a lot for the update! I'll increase the page size for now, in order to lower the number of used page files per transaction.

                      • 8. Re: Journaling errors
                        clebert.suconic

                        Almost fixed now: https://jira.jboss.org/browse/HORNETQ-440

                         

                         

                        I can replicate it in a simple unit test without needing a long running test. It's much easier to fix it now

                        • 9. Re: Journaling errors
                          clebert.suconic

                          I fixed it on trunk. I could run your test for 4 hours straight without any failures.. and my tests are also passing.

                           

                           

                          If you could also run the tests yourself on trunk.

                          • 10. Re: Journaling errors
                            ronnys

                            Thanks a lot for the quick resolution Clebert! I've just started the test and will keep it running over the weekend.

                            • 11. Re: Journaling errors
                              ronnys

                              Hi Clebert,

                               

                              looks much better now. The first run failed for some reason when the HornetQ server suddenly stopped responding to client requests. Looks like the JVM locked up. Could not reproduce this yet. The second test is running now.

                               

                              During the second test I received 4999 of the following messages:

                              [pool-4-thread-1] 14:52:53,910 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 94806219 during compacting

                               

                              within one second. It looks like no messages were lost. No errors were propagated to the client. The test is still running with a commit batch size of 5000, which nearly matches the number of error messages if that somehow helps.

                               

                              Have a nice weekend & Best regards,
                              Ronny

                              • 12. Re: Journaling errors
                                clebert.suconic

                                I will try to replicate it... are you still using the same parameters and config to your test? If not how you're running it now.

                                 

                                 

                                BTW: There are still PageTransactions leaking.. as I don't want to fix it until we fix compacting. (one for every TX you commit and used more than one page during commit).

                                • 13. Re: Journaling errors
                                  ronnys

                                  Hi Clebert,

                                   

                                  I will try to replicate it... are you still using the same parameters and config to your test? If not how you're running it now.

                                   

                                  Thanks a lot. Sorry, forgot to add this. I enhanced the load test tool a bit over time, new version is attached. The parameters for this test were

                                   

                                  org.jnp.interfaces.NamingContextFactory jnp://server:port ConnectionFactory queuename 22822861464256533 241200000 2500000 25000 5 25 10 2048 5000 1000000 true

                                  BTW: There are still PageTransactions leaking.. as I don't want to fix it until we fix compacting. (one for every TX you commit and used more than one page during commit).

                                   

                                  Ok. There were still a lot of errors thrown by the test I ran over the weekend and the test actually failed after 22h. HornetQ log is attached; don't know if the errors are being caused by this known leak.

                                   

                                  Best regards,

                                  Ronny

                                   

                                  • 14. Re: Journaling errors
                                    clebert.suconic

                                    I can't make it to fail...

                                     

                                     

                                    how long it takes for you?

                                     

                                    Are you starting the test with a clean directory?

                                     

                                     

                                    Are you sure you updated trunk? (you should make try a build.sh clean and ./build.sh distro)

                                    1 2 3 Previous Next