1 2 Previous Next 17 Replies Latest reply on May 12, 2011 8:18 AM by clebert.suconic

    commit timeout but data lost

    hughbragg

      When commiting a large transaction while the system is under load, I received a timeout

       

      javax.jms.JMSException: Timed out waiting for response when sending packet 43

              at org.hornetq.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:277)

              at org.hornetq.core.client.impl.ClientSessionImpl.commit(ClientSessionImpl.java:514)

              at org.hornetq.core.client.impl.DelegatingSession.commit(DelegatingSession.java:156)

              at org.hornetq.jms.client.HornetQSession.commit(HornetQSession.java:229)

              at com.agilityapplications.resources.jms.JMSQ.commit(JMSQ.java:293)

              at com.agilityapplications.adapt.simpsons.Homer.acknowledgeBatch(Homer.java:86)

              at com.agilityapplications.adapt.simpsons.Homer.myTask(Homer.java:52)

              at com.agilityapplications.adapt.simpsons.Simpson.runMe(Simpson.java:53)

              at com.agilityapplications.resources.runtime.AgilityThread.run(AgilityThread.java:52)

      Caused by: HornetQException[errorCode=3 message=Timed out waiting for response when sending packet 43]

              ... 9 more

       

       

      This wouldn't be a problem I thought as it should send these messages again.

      I was surprised when the messages were not sent again and had been removed from the queue.

       

      I was under the impression that the commit wasn't permanant until the client had complete the hand shake with an ACK once the server acknowledged the commit batch.

       

      Now these records are lost and cannot be recovered unless the client still has them and cannot be resequenced unless the client catches the timeout and distinguishes this JMSException from other JMSExceptions.

       

      Could someone please advise on HornetQ implementation of the JMS standard concering commit timeouts. My reading was that the standard isn't explicit on this point, though I would have thought this to be crucial.

       

      Specifically, what can be done to prevent this happening in the future?

       

      Is there a way to guarentee delivery regardless of network problems assuming the network works correctly most of the time?

       

      Cheers

        • 1. commit timeout but data lost
          ataylor

          This wouldn't be a problem I thought as it should send these messages again.

          I was surprised when the messages were not sent again and had been removed from the queue.

          Why would you think that they should be sent again, theres nothing in the JMS spec to specify what should happen on connection failure. A transaction gaurantees you all or nothing so if you commit a tx you get acid semantics. If theres an underlying connection problem then the server rolls back the tx, so any sent messages are disregarded and any consumed messages are p[laced back on the queue.

           

          I was under the impression that the commit wasn't permanant until the client had complete the hand shake with an ACK once the server acknowledged the commit batch.

          No, its possible for the commit to succeed and a failure to occur before the commit response has returned to the client, this is the same with XA as well altho with xa there is more info on committed tx's

           

          Now these records are lost and cannot be recovered unless the client still has them and cannot be resequenced unless the client catches the timeout and distinguishes this JMSException from other JMSExceptions.

          There not lost, they have been rolled back

           

          Could someone please advise on HornetQ implementation of the JMS standard concering commit timeouts. My reading was that the standard isn't explicit on this point, though I would have thought this to be crucial.

          There is no timeout on the commit itself (on the server that is), we are the only resource and the commit just writes to our journal. If your talking about the time out to the client commit call, this is slightly different, the call itself can timeout (altho this is configurable) but this has nothing to do with whether the actual commit on the server has taken place.

           

          Specifically, what can be done to prevent this happening in the future?

           

          Is there a way to guarentee delivery regardless of network problems assuming the network works correctly most of the time?

          Well, i'm not 100% sure what is happening in your case, we would need more info, is the server becoming overloaded, has there been a network issue etc etc, i.e. server and client logs, however you can either configure reconnect or maybe use XA as XA tx's are not bound the clients lifespan.

          1 of 1 people found this helpful
          • 2. commit timeout but data lost
            ataylor

            by the way, theres a jira for dealing with large transactions, theres an issue where the server could run out of memory.

            • 3. Re: commit timeout but data lost
              clebert.suconic

              Could someone please advise on HornetQ implementation of the JMS standard concering commit timeouts. My reading was that the standard isn't explicit on this point, though I would have thought this to be crucial.

               

              Specifically, what can be done to prevent this happening in the future?

               

              You should be using XA Transactions probably.

               

              If you are using the latest version, the XA protocol will send a retry, and you will be able to either complete all the branches or rollback all the branches.

               

              Basically there's no way how to achieve 100% guarantee of commit without XA.

              • 4. Re: commit timeout but data lost
                clebert.suconic

                What I meant by that was:

                 

                 

                You sent the commit to the server containing ACKs.

                 

                You could have an edge case where the communication failed when the packet response was being sent to the client, and the commit was already performed.

                 

                 

                when you restart the server, the commit was already done with these ACK even though you rolled back your other branch (it could be DB... etc)

                • 5. Re: commit timeout but data lost
                  clebert.suconic

                  Please.. ignore my messages.. I thought you were doing ACKs.. and the ACKs were being committed.

                   

                   

                  On your case the commit wasn't accepted. As Andy said.. there was no message loss.. your commit wasn't accepted. Which is expected.

                  • 6. Re: commit timeout but data lost
                    hughbragg

                    Thanks for your responses. I check my code again and found that I simply tried to restart the connection after the JMSException was thrown.

                    I changed it to try to rollback first, but that had no affect.

                     

                    It's true, I should be using XA transactions, but the redution in throughput would likely hurt.

                     

                    Anyway, it seems to be fairly easily reproducable. Just pump 100,000 records onto a queue as fast as possible 1 at a time while consuming the queue in batches at the same time and the client usually throws this exception at some point. Most of the time the records are gone as the server has already commited.

                     

                    I'll do some more testing, but we don't normally see those kinds of loads. Just during my load testing. I'll see if the timout settings can make any difference.

                    • 7. Re: commit timeout but data lost
                      hughbragg

                      It doesn't seem right that I need to use XA transactions. A commit is a commit and should always work fully or not work. XA transactions are for multiple systems which this is not. I just do a single commit and the server commits while the client fails. This is a bug for sure.

                       

                      I forgot to mention. I'm retrieving 500 messages per transaction commit.

                       

                      It seems that retrieving less messages and increasing timeout does go some way to alleviating the problem.

                       

                      By the way, the server keeps logging the messages like this:

                      [hornetq-failure-check-thread] 13:09:11,189 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl]  Connection failure has been detected: Did not receive ping from /192.168.196.121:49686. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

                      [hornetq-failure-check-thread] 13:09:11,698 WARNING [org.hornetq.core.server.impl.ServerSessionImpl]  Client connection failed, clearing up resources for session f0885f06-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:11,773 WARNING [org.hornetq.core.server.impl.ServerSessionImpl]  Cleared up resources for session f0885f06-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:11,774 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Client connection failed, clearing up resources for session f0885f06-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:11,847 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Cleared up resources for session f0885f06-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,461 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl]  Connection failure has been detected: Did not receive ping from /192.168.196.121:49687. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

                      [hornetq-failure-check-thread] 13:09:12,462 WARNING [org.hornetq.core.server.impl.ServerSessionImpl]  Client connection failed, clearing up resources for session f2b06f29-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,467 WARNING [org.hornetq.core.server.impl.ServerSessionImpl]  Cleared up resources for session f2b06f29-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,468 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Client connection failed, clearing up resources for session f2b06f29-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,715 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Cleared up resources for session f2b06f29-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,787 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl]  Connection failure has been detected: Did not receive ping from /192.168.196.121:49857. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

                      [hornetq-failure-check-thread] 13:09:12,788 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Client connection failed, clearing up resources for session a6d86df0-79e8-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,834 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Cleared up resources for session a6d86df0-79e8-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,835 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Client connection failed, clearing up resources for session a6e0ab51-79e8-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,837 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Cleared up resources for session a6e0ab51-79e8-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,844 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl]  Connection failure has been detected: Did not receive ping from /192.168.196.121:49683. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

                      [hornetq-failure-check-thread] 13:09:12,845 WARNING [org.hornetq.core.server.impl.ServerSessionImpl]  Client connection failed, clearing up resources for session eeede8e3-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,846 WARNING [org.hornetq.core.server.impl.ServerSessionImpl]  Cleared up resources for session eeede8e3-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,846 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Client connection failed, clearing up resources for session eeede8e3-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:09:12,849 WARNING [org.hornetq.core.protocol.core.ServerSessionPacketHandler]  Cleared up resources for session eeede8e3-79e7-11e0-aa1c-001fc6ffeb61

                      [hornetq-failure-check-thread] 13:10:12,900 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl]  Connection failure has been detected: Did not receive ping from /192.168.196.121:49907. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

                      [hornetq-failure-check-thread] 13:10:14,903 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl]  Connection failure has been detected: Did not receive ping from /192.168.196.121:49909. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

                      [hornetq-failure-check-thread] 13:10:14,937 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl]  Connection failure has been detected: Did not receive ping from /192.168.196.121:49908. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

                       

                      So the server is obviously very confused even though it is causing the problem and the client does a full disconnect/reconnect.

                      • 8. Re: commit timeout but data lost
                        ataylor
                        This is a bug for sure

                        Don't be so certain, we don't have enough info to say whether it is or not, however from what you have told us, one of the following may be happening.

                         

                        1. The commit takes longer to complete than the call timeout. This is not a bug, even tho the call times out the commit still occurs. solution, up the call timeout.

                         

                        2. The server runs out of memory because of the size of the transaction, this is covered in https://issues.jboss.org/browse/HORNETQ-103

                         

                        3. The actual commit itself is not working, this is unlikely since it basically only writes a commit record.

                         

                        If you could provide either more information or even better a standalone unit test then maybe we can take a look.

                        1 of 1 people found this helpful
                        • 9. Re: commit timeout but data lost
                          hughbragg

                          That's a good response and helps me to clearly see what my options are.

                           

                          I'm sure that this problem is caused by number 1. as increasing the timeout pretty much fixes this.

                           

                          It does leave me in a quandry as raising the timeout causes the service to be unresponsive to shutdown signals.

                           

                          The real problem as I see it is that the client needs to be certain if the commit was successful. Since the client is controlling the workflow it's acurate operation is crucial to the overall process and it's next decision after a commit is dependant on the response it receives from the server. How is it possible to design a reliable workflow mechanism without reliable information.

                           

                          My interpretation of commit is all or nothing and in this case it isn't always all and it's certainly not nothing. Perhaps the method should be called maybeCommit instead of commit.

                           

                          I accept your response but I don't see it as an answer to the question and I still see this as a bug, even if it is by design.

                           

                          There needs to be a "reliable" mechinism as specified by the JMS PERSISTENT CLIENT_ACKNOWLEDGE mode.

                           

                          Perhaps there could be a method to call to check if a batch was commited. That way when I get a JMSException during commit, I can check to see if that batch was rolled back or what happened and then make an informed decision as to how to proceed.

                           

                          This may be the only way to operate where there is a very unreliable connection.

                          • 10. Re: commit timeout but data lost
                            clebert.suconic

                            It's all or nothing, but on your case you have two resources you are committing. You can't commit the second until the first resources was confirmed, hence you need XA.

                             

                            > Perhaps there could be a method to call to check if a batch was commited.

                             

                             

                            That's the point.. there's something on XA

                            • 11. Re: commit timeout but data lost
                              hughbragg

                              That's true Clebert, except that's not what I'm asking about.

                               

                              I'd be happy if I could reliably commit and have the client know 100% that the commit was/wasn't successful.

                               

                              Consider an application that just consumed messages without storing them elsewhere. Say it just dumps them while it looks for some specific message. If there is a lot of noise on the connection from time to time, messages might be lost because the client gets an JMSException during commit, but the server has commited. The client can only assume that the commit was unsuccessful and so expects the message to be resent. How does XA apply here?

                               

                              XA transactions will slow throughput significantly when there is a reliable connection. I just want some callback I can use after a JMSException is thrown during commit which tells me if the batch was commited.

                              • 12. Re: commit timeout but data lost
                                clebert.suconic

                                With XA you can query the server if the XA is available or not at the server.

                                 

                                The issue with XA is that you will have extra syncs to validate if the commit was accepted or not.

                                 

                                 

                                Anything beyond that would be a new feature.

                                 

                                 

                                You could maybe use XA directly. It wouldn't be very difficult.

                                 

                                start / end / prepare / commit

                                 

                                if you get a failure on any case you could lookup for the XID. If you can't find it after a commit, then the XID is already committed.

                                 

                                If you can find it... then you decide what to do.

                                • 13. Re: commit timeout but data lost
                                  ataylor

                                  Clebert is correct, what you are asking is why XA tx's are there, they are not bounded by the lifetime of the session. Any non xa transaction system would be the, there is no way of knowing if a commit was accepted or not on a connection failure without keeping track of all transactions. This is what XA gives you.

                                  • 14. Re: commit timeout but data lost
                                    hughbragg

                                    I agree with Cleberts assessment as well and I will definately give his suggestion a try. It sounds interesting.

                                    All I'm saying is that commit() is pretty lame from a developers perspective if it's not 100% reliable and so I suggested a way to make it so.

                                    1 2 Previous Next