2 Replies Latest reply on Mar 13, 2012 8:53 AM by cwong15

    HA: errors on resurrected live server after failover

    cwong15

      Hi. I am testing HA/failover on HornetQ 2.2.5 based on the examples. I am puzzled about connection error messages that come up during the failover scenario. This is the sequence:

      1. Run 2 HornetQ instances, configured as a live/backup HA pair using discovery.
      2. Shut down the live instance (failover-on-shutdown is true).
      3. The backup becomes live, as expected.
      4. Start up the original live instance (allow-failback is true).
      5. The newly resurrected instance starts logging reconnection error messages every 2 seconds. Everything seems to work otherwise.

       

      This is what the error messages look like:

       

      03-12-2012 16:12:51 DEBUG impl.ClientSessionFactoryImpl: Trying reconnection attempt 21

      03-12-2012 16:12:51 DEBUG netty.NettyConnector: Started Netty Connector version 3.2.3.Final-r${buildNumber}

      03-12-2012 16:12:51 DEBUG impl.ClientSessionFactoryImpl: Trying to connect at the main server using connector :org-hornetq-core-remoting-impl-netty-NettyConnectorFactory?port=5446&host=172-17-172-5&tcp-send-buffer-size=262144&tcp-no-delay=true&tcp-receive-buffer-size=262144

      03-12-2012 16:12:51 DEBUG impl.ClientSessionFactoryImpl: Main server is not up. Hopefully there's a backup configured now!

       

      This seems to be the stack trace where this is happening:

       

      "Thread-1 (group:HornetQ-client-global-threads-1119552518)" daemon prio=10 tid=0x00007f292c00f000 nid=0x76e1 in Object.wait() [0x00007f2940b9e000]

         java.lang.Thread.State: TIMED_WAITING (on object monitor)

              at java.lang.Object.wait(Native Method)

              - waiting on <0x00000007b1b55178> (a java.lang.Object)

              at org.hornetq.core.client.impl.ClientSessionFactoryImpl.getConnectionWithRetry(ClientSessionFactoryImpl.java:916)

              - locked <0x00000007b1b55178> (a java.lang.Object)

              at org.hornetq.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:840)

              at org.hornetq.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:588)

              - locked <0x00000007b1b549b8> (a java.lang.Object)

              at org.hornetq.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:482)

              at org.hornetq.core.client.impl.ClientSessionFactoryImpl.access$800(ClientSessionFactoryImpl.java:78)

              at org.hornetq.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1318)

              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.callFailureListeners(RemotingConnectionImpl.java:528)

              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:298)

              at org.hornetq.core.client.impl.ClientSessionFactoryImpl$Channel0Handler$1.run(ClientSessionFactoryImpl.java:1262)

              at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:662)

       

      The reconnection attempts are failing because the resurrected server has become the live server again (failback), but this same live server is trying to connect to the server that has reverted to backup mode. What puzzles me is that I do not have any retries set on my connection factories, so they should not be attempting continuously to reconnect. Where is this connection activity coming from, and is it benign?

       

      For what it's worth, these connection factories are configured in my hornetq-jms.xml:

       

      <connection-factory name="hornetqConnectionFactory">

              <xa>false</xa>

              <connectors>

                  <connector-ref connector-name="netty-connector"/>

              </connectors>

              <entries>

                  <entry name="/hornetqConnectionFactory"/>

              </entries>

              <ha>true</ha>

              <use-global-pools>false</use-global-pools>

      </connection-factory>

       

      <connection-factory name="hornetqXaConnectionFactory">

              <xa>true</xa>

              <connectors>

                  <connector-ref connector-name="netty-connector"/>

              </connectors>

              <entries>

                  <entry name="/hornetqXaConnectionFactory"/>

              </entries>

              <ha>true</ha>

              <use-global-pools>false</use-global-pools>

      </connection-factory>

       

      Thanks in advance for any insight.

        • 1. Re: HA: errors on resurrected live server after failover
          gaohoward

          Can you attach your configure files for both live and backup node? The connection activity seeing probably is the live node trying to establish internal communication with the backup.

           

          Howard

          • 2. Re: HA: errors on resurrected live server after failover
            cwong15

            Thanks for responding. I have attached my hornetq-configuration.xml file which is identical between the live and backup server. The ${hornetq.ha_backup} parameter is set to true or false appropriately to configure live/backup status. The hornetq-jms.xml file too is identical between the two.

             

            You may be right that this is an internal communication issue. I don't believe this behavior is limited to HA live/backup failbacks. I just tested a cluster configuration and found the same error messages being logged if one of the nodes is shut down, until that downed node is restarted. The same reconnect stack trace is also present in the thread dump. The difference with the HA configuration is that the live node will log those reconnect errors "forever" because the backup node will never become live again under normal circumstances.