1 2 3 4 5 Previous Next 68 Replies Latest reply on Sep 13, 2010 9:09 AM by clebert.suconic Go to original post Branched to a new discussion.
      • 45. Re: Journaling errors
        clebert.suconic

        I fixed an error that would cause duplicates during compacting (When doing the investigations about this thread), but that was fixed after r9522.

         

        I will run your tests also. Give me an hour or so and I will let you know.

         

        "I'm planning to do some tests with kill SIGINT/SIGKILL as well; looks like HornetQ detects SIGINT and shuts down; don't know what happens to the journal etc. on SIGKILL. Is it built to automatically recover from hard shutdowns?"

         

        Yes.. it's built to recover from hard shutdowns.

        • 46. Re: Journaling errors
          clebert.suconic

          Your test is currently invalid.

           

          You would need to play with XA and things like that to make it valid.

           

          When you call session.commit();

           

          You could have a failure at the time where the server was supposed to answer the commit.

           

          You would need to use XA and play with recover to make sure the commit was accepted by the server. (The same scenario woud apply to kill)

           

           

          BTW: I'm not suggestion you to use XAResource directly. That would be job for a TransactionManager. But I will see if i can change your test.

           

           

          The bug I fixed wouldn't show up at your test. (It would be a rare occasion where the ACK was rolled back, compacting cleared this record and you restarted the server right after compacting was finished).

          • 47. Re: Journaling errors
            clebert.suconic

            That would be an interesting test to do, but if you want to validate message receiving over shutdowns / kills, the test has even more basic stuff broken.

             

            For example, when you reconnect, you're recreating the consumer, hence the messages will be received again, and you're not clearing the batch messages on consumer.

             

            I have tried to change a couple of things on the test.. but I couldn't do finish today.

             

            I will attach it here to give you an idea of what I wanted to change.

            • 48. Re: Journaling errors
              ronnys

              Hi Clebert,

               

              thanks a lot for finding this out and for the changes.

               

              I wasn't aware that I'd need to use XA transactions for this. This is somehow out of scope for the planned use case. I'll create another test program that will handle errors and duplicates on the client side; this might be sufficient.

               

              Best regards,

              Ronny

              • 49. Re: Journaling errors
                ronnys

                Hi Clebert,

                 

                just ran some tests with the new client. It is as said handling errors while sending/receiving/committing batches by its own by simply retrying and resending the failed batch on the sender side and handling duplicates on the receiving side. Nothing special. Hope you agree this is ok. Received duplicates are ok (and expected as generated by the client) and could be filtered, lost messages are not ok. Code is attached, tool can be started like this (with the same JNDI properties files the older tool is using):

                 

                MultiThreadedLoadTest sender.properties receiver.properties stopfile 1750000 2250000 10 2048 5000 true

                 

                When running it against HornetQ 2.1.2.final started/stopped constantly with the previously posted script (with just a minor modification of keeping the server up for 60s instead of 15s to give it more time to start up in case it is doing something else in the background) I observed the following:

                 

                The first few restarts usually happen without any issues. After a few restarts I get different journaling log entries in the HornetQ logs (INFO to SEVERE; some NPEs): attached. Don't know if they're expected despite of the clean shutdown. After a while (1-2 hours) and a bunch of restarts later the client starts to report missing messages ... It looks like the issue is building up over time; could it be that the server takes longer and longer to perform some journaling cleanups after it signalled its finished startup and gets finally interrupted by the stop request at an inconvenient time? I saw even cases, where the client did not receive any messages while the server as up and running; the next server restart fixed this.

                 

                Hope this helps,

                Best regards,

                Ronny

                • 50. Re: Journaling errors
                  clebert.suconic

                  this seems a different issue than what we have been working before.

                   

                  I couldn't replicate the issue with your test, as my disk was set to sync, what limited rate from the test...

                   

                  I will work on it next week.

                  • 51. Re: Journaling errors
                    ronnys

                    Hi Clebert,

                     

                    ok, thanks a lot. Btw, I see the same NPEs reported yesterday

                     

                    [JournalImpl::CompactorExecutor] 15:05:45,964 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Error replaying pending commands after compacting
                    java.lang.NullPointerException
                        at org.hornetq.core.journal.impl.JournalCompactor$DeleteCompactCommand.execute(JournalCompactor.java:476)
                        at org.hornetq.core.journal.impl.JournalCompactor.replayPendingCommands(JournalCompactor.java:234)
                        at org.hornetq.core.journal.impl.JournalImpl.compact(JournalImpl.java:1630)
                        at org.hornetq.core.journal.impl.JournalImpl$6.run(JournalImpl.java:2396)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                        at java.lang.Thread.run(Thread.java:619)

                     

                    when running the multithreaded client against the divert. In order to run the test you'd need to run

                     

                    MultiThreadedLoadTest sender.properties receiver1.properties,receiver2.properties stopfile 1750000 2250000 10 2048 5000 true

                     

                    where the sender and receiver properties files are the ones attached here:

                     

                    http://community.jboss.org/message/555362#555362

                     

                    However, as far as I can see, there is no immediate impact on the client. It does not report any lost or duplicate messages.

                     

                    Best regards,

                    Ronny

                    • 52. Re: Journaling errors
                      timfox

                      Clebert Suconic wrote:

                       


                      I couldn't replicate the issue with your test, as my disk was set to sync,


                      If the user is running with disk write cache enabled, then testing for data loss is pointless.

                       

                      Data may be lost in event of crash, even if HornetQ is doing everything properly! This is discussed in the user manual, persistence chapter.

                      • 53. Re: Journaling errors
                        timfox

                        I also noticed the user's test program is not doing any duplicate detection on either send or consume transactions.

                         

                        If the server fails after a commit is sent from client but before it has finished processing on the server, the tx won't be committed and the client will have no way of determining whether the tx is committed or not. This is a standard (non HornetQ specific) issue with any RPC protocol.

                         

                        HornetQ implements automatic duplicate detection for sends (see user manual). For consumes you will have to code it yourself.

                        • 54. Re: Journaling errors
                          ronnys

                          Tim Fox wrote:

                           

                          If the user is running with disk write cache enabled, then testing for data loss is pointless.

                           

                          Data may be lost in event of crash, even if HornetQ is doing everything properly! This is discussed in the user manual, persistence chapter.

                           

                          I'm starting/stopping HornetQ using its run.sh/stop.sh scripts. HornetQ is neither being killed nor do I restart the machine, it is (hopefully) a clean shutdown from HornetQ's point of view. No data should be lost under these circumstances.

                          Tim Fox wrote:

                           

                          I also noticed the user's test program is not doing any duplicate detection on either send or consume transactions.

                           

                          If the server fails after a commit is sent from client but before it has finished processing on the server, the tx won't be committed and the client will have no way of determining whether the tx is committed or not. This is a standard (non HornetQ specific) issue with any RPC protocol.

                           

                          HornetQ implements automatic duplicate detection for sends (see user manual). For consumes you will have to code it yourself.

                           

                          The second client (attached to this posting: http://community.jboss.org/message/557900#557900) performs custom duplication checking to handle the failure-on-commit scenario you mentioned.

                          • 55. Re: Journaling errors
                            timfox

                            Ronny Schuetz wrote:

                             

                            I'm starting/stopping HornetQ using its run.sh/stop.sh scripts. HornetQ is neither being killed nor do I restart the machine, it is (hopefully) a clean shutdown from HornetQ's point of view. No data should be lost under these circumstances.

                            That's right. I assumed you were killing the server.

                            • 56. Re: Journaling errors
                              ronnys

                              Hi again,

                              Ronny Schuetz wrote:

                              when running the multithreaded client against the divert. In order to run the test you'd need to run

                               

                              MultiThreadedLoadTest sender.properties receiver1.properties,receiver2.properties stopfile 1750000 2250000 10 2048 5000 true

                               

                              where the sender and receiver properties files are the ones attached here:

                               

                              http://community.jboss.org/message/555362#555362

                               

                              However, as far as I can see, there is no immediate impact on the client. It does not report any lost or duplicate messages.

                               

                              The same test dumps warnings like that (no stacktrace) if HornetQ is being configured to use 100k journaling files (I actually wanted to test that ...)

                               

                              [JournalImpl::CompactorExecutor] 11:44:44,041 WARNING [org.hornetq.core.journal.impl.JournalImpl]  Couldn't find tx=9317246 to merge after compacting

                               

                              Best regards,

                              Ronny

                               

                              Message was edited by: Ronny Schuetz; note about the 100k journal files added ... forgot that ...

                              • 57. Re: Journaling errors
                                clebert.suconic

                                I have debugged this warning and it is harmless. It only happened with the small file size, and it happened because a file was cleared up, but the transaction that's been complained would have been already cleared.

                                 

                                but I will eliminate the condition that let this message happen on the next version.

                                • 58. Re: Journaling errors
                                  ronnys

                                  Hi Clebert,

                                   

                                  great! One more I got with the same test (i.e. without server restarts; just 100k journaling files and the multithreaded loadtest testing diverts) running overnight:

                                   

                                  [Old I/O server worker (parentId: 1299509292, channelId: 653327464, null => ttwsrvdev02.dev.ttw/10.115.151.40:61445)] 22:42:28,177 SEVERE [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Caught unexpected exception
                                  java.lang.IllegalStateException: 1187155943 Could not find reference on consumerID=0, messageId = 42297410 queue = jms.queue.msgduplicate_divert_outbound_2 closed = true
                                          at org.hornetq.core.server.impl.ServerConsumerImpl.acknowledge(ServerConsumerImpl.java:512)
                                          at org.hornetq.core.server.impl.ServerSessionImpl.acknowledge(ServerSessionImpl.java:513)
                                          at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:290)
                                          at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:471)
                                          at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:451)
                                          at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:412)
                                          at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(RemotingServiceImpl.java:459)
                                          at org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.messageReceived(HornetQChannelHandler.java:67)
                                          at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:100)
                                          at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:362)
                                          at org.jboss.netty.channel.StaticChannelPipeline$StaticChannelHandlerContext.sendUpstream(StaticChannelPipeline.java:514)
                                          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:287)
                                          at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.decode(HornetQFrameDecoder2.java:169)
                                          at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.messageReceived(HornetQFrameDecoder2.java:134)
                                          at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
                                          at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:362)
                                          at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:357)
                                          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
                                          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
                                          at org.jboss.netty.channel.socket.oio.OioWorker.run(OioWorker.java:90)
                                          at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
                                          at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
                                          at org.jboss.netty.util.VirtualExecutorService$ChildExecutorRunnable.run(VirtualExecutorService.java:181)
                                          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                                          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                                          at java.lang.Thread.run(Thread.java:619)
                                  [Old I/O server worker (parentId: 1299509292, channelId: 653327464, null => ttwsrvdev02.dev.ttw/10.115.151.40:61445)] 22:42:28,194 SEVERE [org.hornetq.core.server.impl.ServerSessionImpl]  There is no consumer with id 0
                                  [hornetq-failure-check-thread] 22:43:30,724 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl]  Connection failure has been detected: Did not receive ping from /10.115.151.40:64116. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

                                   

                                  The client recorded connectivity issues (... org.hornetq.core.logging.impl.JULLogDelegate ... WARNING: Connection failure has been detected: Did not receive data from server for org.hornetq.core.remoting.impl.netty.NettyConnection@70f12110[...] [code=3]) 15s before the exception above (client and server are running on the same box), the client thread connected to jms.queue.msgduplicate_divert_outbound_2 received duplicates at the time of the exception and failed to commit received messages a few milliseconds later. But besides of this, the whole client obviously recovered afterwards and is still running and working.

                                   

                                  Best regards,

                                  Ronny

                                  • 59. Re: Journaling errors
                                    clebert.suconic

                                    Ronny, I'm working full time on this as you know..

                                     

                                    And I'm having a hard time to replicate the exact scenario you're having.

                                     

                                     

                                    I know what could cause the issue, so I would be able to replicate it.. but since you already did it.. what property files were you using? still the ones from the divert case, and are you using divert configuration on the server to make this happen?

                                     

                                     

                                    I ran your test already for a long period without getting issues.