11 Replies Latest reply on Sep 24, 2013 11:51 AM by manu_1185

    Failover works, but on fail-back, client doesn't reconnect

    manu_1185

      We have a messaging application where we are using hornetq 2.3.0.Final. While testing hornetq fail-over and fail-back setup for high availability (with one live and one backup server), I am coming across one issue for which I am not able to find any solution.

       

      I start the live and backup server. I can see backup announced log in backup server logs. I start a client application which connects to live server, creates consumers and producers and sends and receives messages. When I kill the live server, the failover works fine (with automatic client fail-over to backup node). The backup server becomes live, client application connects to the backup server and messaging works fine (though after the backup server is up, I can see the following exception in backup server logs:)

       

      08:37:14,668 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 3.6.2.Final-c0d783c 10.0.1.6:6455 for CORE protocol

      08:37:14,672 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 3.6.2.Final-c0d783c 10.0.1.6:6445 for CORE protocol

      08:37:14,679 WARN  [org.hornetq.core.client] HQ212028: error starting server locator: HornetQException[errorType=ILLEGAL_STATE message=null]

              at org.hornetq.core.client.impl.ServerLocatorImpl.initialise(ServerLocatorImpl.java:371) [hornetq-core-client.jar:]

              at org.hornetq.core.client.impl.ServerLocatorImpl.start(ServerLocatorImpl.java:566) [hornetq-core-client.jar:]

              at org.hornetq.core.client.impl.ServerLocatorImpl$StaticConnector$1.connectionFailed(ServerLocatorImpl.java:1773) [hornetq-core-client.jar:]

              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.callFailureListeners(RemotingConnectionImpl.java:570) [hornetq-core-client.jar:]

              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:341) [hornetq-core-client.jar:]

              at org.hornetq.core.client.impl.ClientSessionFactoryImpl$CloseRunnable.run(ClientSessionFactoryImpl.java:1631) [hornetq-core-client.jar:]

              at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:106) [hornetq-core-client.jar:]

              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_15]

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_15]

              at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_15]

       

      When I bring the live server up again, fail-back happens and the backup server shuts down (though I can see the above exception on live server logs now). At this point, automatic client fail-back doesn't happen. The client application keeps waiting and doesn't re-connect to the live server now. If now, I start the backup server again, and then kill the live server again (causing the backup server to become live again), the client application re-connects to the backup server (which is now live) and messaging starts to work again. My guess is that earlier the client application kept waiting to re-connect to the backup server (when it should have re-connected to the live server after fail-back has happened).

       

      Is this behavior expected or am I doing anything wrong. Restarting clients solves everything but I need clients to fail-back automatically.

       

      Attached are my configuration files. Any help would be appreciated.

       

      Thanks,

      Manu

        • 1. Re: Failover works, but on fail-back, client doesn't reconnect
          clebert.suconic

          I would add server1-connector to the other server.

           

           

          The backup is being able to announce itself to live, but the life when acting as backup is not being able to announce itself.

          • 2. Re: Failover works, but on fail-back, client doesn't reconnect
            manu_1185

            I have added server1-connector in server0 (live server) and server0-connector in server1 (backup server). I have also configured cluster connections on both servers to be able to connect to each other. I have also verified that the live server, when acting as backup, is able to announce itself. I can see "backup announced" in live server logs after it is restarted. After that, the backup server shuts down (for live to take over and failback to happen).

             

            Is there any issue in the configuration files I have attached which can cause this problem? I want the clients to re-connect automatically to live server after fail-back happens and backup server shuts down.

            • 3. Re: Failover works, but on fail-back, client doesn't reconnect
              clebert.suconic

              You added and what happened?

              • 4. Re: Failover works, but on fail-back, client doesn't reconnect
                manu_1185

                What I meant was that it was already added in my initial configuration files (which I had attached in my first post).

                • 5. Re: Failover works, but on fail-back, client doesn't reconnect
                  ataylor

                  does the failback example work for you?

                   

                  also do you see the live announce itself as a backup when it restarts?

                  • 6. Re: Failover works, but on fail-back, client doesn't reconnect
                    manu_1185

                    Yes, the live announces itself as backup after restarting. After that backup shutdown and failback happens (live becomes live again). After that I can see the exception in live server logs which I have posted in my first post. But clients do not reconnect to live server now and keep waiting. FYI: I am using static-connectors in my configuration.

                     

                    I will try the example (with static-connectors) and let you know if it works.

                    • 7. Re: Failover works, but on fail-back, client doesn't reconnect
                      manu_1185

                      I tested using the example. I have live server on one machine (10.0.1.4) and backup server on anothert (10.0.1.6). When I am running client on the machine where live server is hosted (10.0.1.4)....everything works fine. But when I am running clients on a third machine....failback doesn't work (athough failover works).

                       

                      I have checked ports are open....so it shouldn't be a network issue. Any ideas what might be wrong?

                      • 8. Re: Failover works, but on fail-back, client doesn't reconnect
                        clebert.suconic

                        I would try changing the example using your IPs.. and then compare with your changes. I suspect you have something wrong with the host announce.. as I have posted earlier.

                        • 9. Re: Failover works, but on fail-back, client doesn't reconnect
                          manu_1185

                          I tried playing with the example. First I started both live and backup nodes on same machine (10.0.1.4) at different ports. Clients were also on same machine and everything went fine. Then I moved moved the backup node on another machine (10.0.1.6) while the live node and clients were still on same machine as earlier (10.0.1.4). This time also everything was fine.

                           

                          In the third step, I moved the clients to a new machine (10.0.0.183). After this step, the problem occured. Failover worked fine while failback didn't. I enabled debug logs on client to find what might be wrong. I can see the following logs on client when FAILOVER happens (these logs didn't come when clients were on same machine as live server i.e 10.0.1.4, but they come when clients are moved to a different machine i.e 10.0.0.183)

                           

                          Update uniqueEvent=1380017101867, nodeId=385cc707-24e2-11e3-b92b-33950045aeb3, memberInput=TopologyMember[name = null, connector=Pair[a=TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=6445&host=10-0-1-6, b=null]] being rejected as there was a delete done after that

                           

                          Similarly during FAILBACK, following logs come:

                           

                          Update uniqueEvent=1380017129168, nodeId=385cc707-24e2-11e3-b92b-33950045aeb3, memberInput=TopologyMember[name = null, connector=Pair[a=TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=6445&host=10-0-1-6, b=TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=10-0-1-4]] being rejected as there was a delete done after that

                           

                          After that I can constantly see following logs (note that it is only trying to connect to the backup server i.e 10.0.1.6, which is shutdown now after live comes up). Also I can always see "backup announced" (initially on backup server and on live server when it is restarted)...so host announce looks to be fine.

                           

                          2013-09-24 10:51:10 DEBUG client:609 - Remote destination: /10.0.1.6:6445

                          2013-09-24 10:51:10 DEBUG client:1229 - Main server is not up. Hopefully there's a backup configured now!

                          2013-09-24 10:51:10 DEBUG client:1229 - Main server is not up. Hopefully there's a backup configured now!

                          Send message: Message [46] to destination [/queue/hike_durable_dest_1], ts1 [1380019863459]

                          2013-09-24 10:51:15 DEBUG client:1066 - Trying reconnection attempt 5/-1

                          2013-09-24 10:51:15 DEBUG client:1207 - Trying to connect with connector = org.hornetq.core.remoting.impl.netty.NettyConnectorFactory@7b41fab6, parameters = {port=6445, host=10.0.1.6} connector = NettyConnector [host=10.0.1.6, port=6445, httpEnabled=false, useServlet=false, servletPath=/messaging/HornetQServlet, sslEnabled=false, useNio=false]

                          2013-09-24 10:51:15 DEBUG client:516 - Started Netty Connector version 3.6.2.Final-c0d783c

                          2013-09-24 10:51:15 DEBUG client:1220 - Trying to connect at the main server using connector :TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=6445&host=10-0-1-6

                          2013-09-24 10:51:15 DEBUG client:609 - Remote destination: /10.0.1.6:6445

                          2013-09-24 10:51:15 DEBUG client:1066 - Trying reconnection attempt 5/-1

                          2013-09-24 10:51:15 DEBUG client:1207 - Trying to connect with connector = org.hornetq.core.remoting.impl.netty.NettyConnectorFactory@67ecd78, parameters = {port=6445, host=10.0.1.6} connector = NettyConnector [host=10.0.1.6, port=6445, httpEnabled=false, useServlet=false, servletPath=/messaging/HornetQServlet, sslEnabled=false, useNio=false]

                          2013-09-24 10:51:15 DEBUG client:516 - Started Netty Connector version 3.6.2.Final-c0d783c

                          2013-09-24 10:51:15 DEBUG client:1220 - Trying to connect at the main server using connector :TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=6445&host=10-0-1-6

                          • 10. Re: Failover works, but on fail-back, client doesn't reconnect
                            jbertram

                            Are the clocks on all the machines involved in the test synchronized?

                            • 11. Re: Failover works, but on fail-back, client doesn't reconnect
                              manu_1185

                              The issue was because the clocks were not in sync. Thanks a lot.