1 2 Previous Next 29 Replies Latest reply on Aug 3, 2011 11:57 AM by ataylor

    Cluster connection on failover

    huma

      Hi forum,

       

      I've recently started to get in a production environment, an error saying "errorCode=2 message=Channel disconnected"  wich causes the cluster to stay in a inconsistent state and the server that was disconnected doesn't receive  messages.

       

      So, during my research I was able to replicate the problem and seems to me that the bridge between the nodes doesn't get "destroyed" so that the live node keeps trying to send messages to the disconnected server.

       

      To replicate this problem you can do the following actions: ( use the configurations attached on the zip file )

       

      1) Start two hornetq instances. ( server1, server2)

      2) Kill the jvm of  server2. (kill -9 PID)

      3) Execute the java code that will send 10000 messages to the server1

       

      I would expect that, if the server2 leaves the cluster that the server1 would receive all the messages, like a normal shutdown, but the thing is nothing like that is happening.

       

      - Immediately after the jvm kill, the first server can't even send a message.

      - After the server1 receives a "Timed out waiting for response when sending packet 71" then he can send the 10000 messages but he only receives half.

       

      I'm using the stand-alone clustered configuration attached to the hornetq zip and changed accordingly

       

      I've attached the java example and all the configurations used in the tgz file

       

      Do I need more configuration ? Is this bug ?

       

      Thank you in advance.

       

      Hugo Marcelino

        • 1. Re: Cluster connection on failover
          ataylor

          Yes this is expected behaviour, the messages are held in a storeandforwardqueue waiying for the second server (or its backup) to come back. SInce you have forward-when-no-consumers=true messages will still be load balanced.

          • 2. Re: Cluster connection on failover
            jnelas

            Just to clarify, on a cluster, if one of the machines goes down, the messages that would normally be delivered to that machine will wait indefinitely for that machine to recover, the other machines will not receive them, correct?

             

            Will this behaviour change as the discussion on http://community.jboss.org/message/584510#584510 seems to imply?

            • 3. Re: Cluster connection on failover
              ataylor

              Just to clarify, on a cluster, if one of the machines goes down, the messages that would normally be delivered to that machine will wait indefinitely for that machine to recover, the other machines will not receive them, correct?

              correct, but like i say, if forward when no consumers=false then load balancing would stop

              Will this behaviour change as the discussion on http://community.jboss.org/message/584510#584510 seems to imply?

              Im not sure i see where its implied?

              • 4. Re: Cluster connection on failover
                jnelas


                Will this behaviour change as the discussion on http://community.jboss.org/message/584510#584510 seems to imply?

                Im not sure i see where its implied?

                If the cluster is considered dead after a while, the messages should be treated by the remaining server.

                • 5. Re: Cluster connection on failover
                  huma

                  Hi Andy,

                  correct, but like i say, if forward when no consumers=false then load balancing would stop

                  I've tried this but the behaviour keeps the same. The messages are still stuck.

                  The client only receives half of the messages.

                  • 6. Re: Cluster connection on failover
                    ataylor

                    thats probably because they have already beed load balanced. just bring back the server and delivery will continue.

                    • 7. Re: Cluster connection on failover
                      ataylor

                      If the cluster is considered dead after a while, the messages should be treated by the remaining server.

                      you sure your pointing me at the right link, i cant see where this is mentioned at all

                      • 8. Re: Cluster connection on failover
                        jnelas

                        Andy Taylor wrote:

                         

                        If the cluster is considered dead after a while, the messages should be treated by the remaining server.

                        you sure your pointing me at the right link, i cant see where this is mentioned at all

                        On last message on the thread by Clebert Suconic he says:

                         

                        "However this is being changed now. I'm introducing a reconnect attempt# on the cluster connection. That means the cluster will be considered dead and the remote binding will be removed on that case."

                         

                        The conclusion that the messages would be treated by the other machines is mine.

                        • 9. Re: Cluster connection on failover
                          huma

                          Andy Taylor wrote:

                           

                          thats probably because they have already beed load balanced. just bring back the server and delivery will continue.

                           

                          In the test that I did, the messages were only sent after I killed the second server. They still got stuck.

                          • 10. Re: Cluster connection on failover
                            ataylor

                            The binding and the queue are different things, once the binding is removed the queue is no longer available to forward messages too, however the queue will still exist and may have messages in it. Im not saying that at some point we wont change this to allow a user to configure what will happen to messages in a queue however, but arbitrarily just sending them back to the server to redistribute is something we won't do. The reason why we have a redistribute setting on queues is that when you do this you can (and probably will) break ordering gaurantees.

                            • 11. Re: Cluster connection on failover
                              jnelas

                              But will the redistribute setting apply to messages that are on the storeandforwardqueue for delivery to a dead cluster member?

                              • 12. Re: Cluster connection on failover
                                ataylor

                                But will the redistribute setting apply to messages that are on the storeandforwardqueue for delivery to a dead cluster member?

                                The redistribute setting is on live nodes that have no consumers, i was just using it to illustrate a scenario.

                                 

                                If you dont have a consistent cluster, i.e. where there a possibilities that servers will crash, then you should configure backup servers. remember, what if the messages were sent to the server just before it crashed, they are still lost.

                                • 13. Re: Cluster connection on failover
                                  jnelas

                                  If you dont have a consistent cluster, i.e. where there a possibilities that servers will crash, then you should configure backup servers. remember, what if the messages were sent to the server just before it crashed, they are still lost.

                                   

                                  True, I get your point.

                                   

                                  On my specific case the servers didn't crash, for some unknown reason the cluster happears to get disconnected at some point and the messages get stuck on the storeandforwardqueue. The cluster eventually reconnects, but the messages are stuck forever.

                                  Newer messages pass though the bridge, but the old ones never get delivered to none of the servers.

                                   

                                  I realize that the connection problems between the cluster are a different issue, but one that would be minimized if there was a way to deal with the messages that are on the storeandforwardqueue.

                                  • 14. Re: Cluster connection on failover
                                    ataylor

                                    On my specific case the servers didn't crash, for some unknown reason the cluster happears to get disconnected at some point and the messages get stuck on the storeandforwardqueue. The cluster eventually reconnects, but the messages are stuck forever.

                                    Newer messages pass though the bridge, but the old ones never get delivered to none of the servers.

                                    Ok, thats slightly different to what you are saying, if the server reconnects then the messages should be sent. can you verify thet they are in the store and forward queue and that the cluster connection is remade.

                                     

                                    Also what version are you using, if there is an issue it may well havebeen fiixed in the last release

                                    1 2 Previous Next