2 Replies Latest reply on Mar 20, 2012 1:14 PM by gdegremont

    HornetQ HA tests, fails to go back to normal state

    gdegremont

      Hello,

       

      I am faced with a strange behavior.

      I have 2 hornetq nodes on 2 hosts, 1 live (HQ1), 1 backup (HQ2)

       

      My test case is the following:

       

      HQ1 is up and live, HQ2 is up and backup.

      I have a message producer (MP) and a message consumer (MC).

       

      HQ1 is killed (through a kill -9)

      HQ2 becomes the live server.

       

      MP loses connection, and timeouts before HQ2 is up as new live.

      MC, loses connection, then gets it again and empties the queue.

       

      HQ1 is restarted, fails to become backup and successfully becomes the new live node.

      ***********************************************************************************

      java -Dcom.sun.management.jmxremote.port=1105 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+UseParallelGC -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms512M -Xmx2048M -Dhornetq.config.dir=/etc/local/bomapps/messaging-bus/config/bestofmedia/ -Djava.util.logging.config.file=/etc/local/bomapps/messaging-bus/config/bestofmedia//logging.properties -Djava.library.path=. -classpath ../lib/twitter4j-core.jar:../lib/netty.jar:../lib/jnpserver.jar:../lib/jnp-client.jar:../lib/jboss-mc.jar:../lib/jboss-jms-api.jar:../lib/hornetq-twitter-integration.jar:../lib/hornetq-spring-integration.jar:../lib/hornetq-logging.jar:../lib/hornetq-jms.jar:../lib/hornetq-jms-client-java5.jar:../lib/hornetq-jms-client.jar:../lib/hornetq-jboss-as-integration.jar:../lib/hornetq-core.jar:../lib/hornetq-core-client-java5.jar:../lib/hornetq-core-client.jar:../lib/hornetq-bootstrap.jar:/etc/local/bomapps/messaging-bus/config/bestofmedia/:../schemas/ org.hornetq.integration.bootstrap.HornetQBootstrapServer hornetq-beans.xml

      ***********************************************************************************

      [main] 15:10:40,387 INFO [org.hornetq.integration.bootstrap.HornetQBootstrapServer]  Starting HornetQ Server

      [main] 15:10:41,259 WARNING [org.hornetq.core.deployers.impl.FileConfigurationParser]  AIO wasn't located on this platform, it will fall back to using pure Java NIO. If your platform is Linux, install LibAIO to enable the AIO journal

      [main] 15:10:41,325 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=true,journalDirectory=/home/sites/messaging-bus/shared/data/journal,bindingsDirectory=/home/sites/messaging-bus/shared/data/bindings,largeMessagesDirectory=/home/sites/messaging-bus/shared/data/large-messages,pagingDirectory=/home/sites/messaging-bus/shared/data/paging)

      [main] 15:10:41,326 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  Waiting to obtain live lock

      [main] 15:10:41,355 INFO [org.hornetq.core.persistence.impl.journal.JournalStorageManager]  Using NIO Journal

      [main] 15:10:41,372 WARNING [org.hornetq.core.server.impl.HornetQServerImpl]  Security risk! It has been detected that the cluster admin user and password have not been changed from the installation default. Please see the HornetQ user guide, cluster chapter, for instructions on how to do this.

      [main] 15:10:41,573 INFO [org.hornetq.core.server.cluster.impl.ClusterManagerImpl]  announcing backup

      [Thread-0 (group:HornetQ-server-threads20327011-24080548)] 15:10:42,068 WARNING [org.hornetq.core.client.impl.ClientSessionFactoryImpl]  Tried 1 times to connect. Now giving up on reconnecting it.

      [Thread-0 (group:HornetQ-server-threads20327011-24080548)] 15:10:42,069 WARNING [org.hornetq.core.server.cluster.impl.ClusterManagerImpl]  Unable to announce backup

      HornetQException[errorCode=2 message=Cannot connect to server(s). Tried with all available servers.]

                at org.hornetq.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:619)

                at org.hornetq.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:503)

                at org.hornetq.core.server.cluster.impl.ClusterManagerImpl$1.run(ClusterManagerImpl.java:860)

                at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)

                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

                at java.lang.Thread.run(Thread.java:662)

      [main] 15:10:46,687 INFO [org.hornetq.core.server.impl.FileLockNodeManager]  Waiting to obtain live lock

      [main] 15:10:57,191 INFO [org.hornetq.core.server.impl.FileLockNodeManager]  Live Server Obtained live lock

      [main] 15:10:58,200 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.queue.monitoringHealthCheck

      [main] 15:10:58,206 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.topic.testing

      [main] 15:10:58,294 INFO [org.hornetq.core.remoting.impl.netty.NettyAcceptor]  Started Netty Acceptor version 3.2.3.Final-r${buildNumber} 0.0.0.0:61613 for STOMP protocol

      [main] 15:10:58,295 INFO [org.hornetq.core.remoting.impl.netty.NettyAcceptor]  Started Netty Acceptor version 3.2.3.Final-r${buildNumber} 0.0.0.0:5445 for CORE protocol

      [main] 15:10:58,304 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  Server is now live

      [main] 15:10:58,304 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  HornetQ Server version 2.2.5.Final (HQ_2_2_5_FINAL_AS7, 121) [60dcd3f9-728f-11e1-bb31-f04da2a29853] started

      [hornetq-discovery-group-thread-dg-group1] 15:10:59,299 WARNING [org.hornetq.core.cluster.impl.DiscoveryGroupImpl]  There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=60dcd3f9-728f-11e1-bb31-f04da2a29853

       

       

      HQ2 transfers live to HQ1 but fails to become backup:

      [Thread-0 (group:HornetQ-scheduled-threads-22745659)] 15:10:47,107 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  live server wants to restart, restarting server in backup

      [Thread-2] 15:10:57,111 WARNING [org.hornetq.core.cluster.impl.DiscoveryGroupImpl]  Timed out waiting to stop discovery thread

      [Thread-2] 15:10:57,128 WARNING [org.hornetq.core.server.management.impl.ManagementServiceImpl]  On ManagementService stop, there are 18 unexpected registered MBeans: [jms.queue.autonomyIndexing, core.divert.reputation-divert, jms.server, jms.queue.monitoringHealthCheck, jms.queue.reputationDLQ, jms.queue.moderationDLQ, jms.queue.reputationQueue, jms.queue.moderationQueue, jms.queue.asynchronousIndexing_copy, core.divert.moderation-divert, jms.topic.junkDetection, jms.topic.testing, jms.queue.asynchronousIndexing, jms.topic.userEvents, jms.topic.tagsManagement, jms.connectionfactory.ConnectionFactory, jms.queue.results, jms.topic.notifications]

      [Thread-2] 15:10:57,184 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  HornetQ Server version 2.2.5.Final (HQ_2_2_5_FINAL_AS7, 121) [60dcd3f9-728f-11e1-bb31-f04da2a29853] stopped

      [Thread-2] 15:10:57,185 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  backup server is starting with configuration HornetQ Configuration (clustered=true,backup=true,sharedStore=true,journalDirectory=/home/sites/messaging-bus/shared/data/journal,bindingsDirectory=/home/sites/messaging-bus/shared/data/bindings,largeMessagesDirectory=/home/sites/messaging-bus/shared/data/large-messages,pagingDirectory=/home/sites/messaging-bus/shared/data/paging)

      [Thread-3] 15:10:57,185 INFO [org.hornetq.core.server.impl.FileLockNodeManager]  Waiting to become backup node

      [Thread-3] 15:10:57,185 INFO [org.hornetq.core.server.impl.FileLockNodeManager]  ** got backup lock

      [Thread-3] 15:10:57,186 INFO [org.hornetq.core.persistence.impl.journal.JournalStorageManager]  Using NIO Journal

      [Thread-3] 15:10:57,186 WARNING [org.hornetq.core.server.impl.HornetQServerImpl]  Security risk! It has been detected that the cluster admin user and password have not been changed from the installation default. Please see the HornetQ user guide, cluster chapter, for instructions on how to do this.

      [Thread-3] 15:10:57,266 INFO [org.hornetq.core.server.cluster.impl.ClusterManagerImpl]  announcing backup

      [Thread-3] 15:10:57,266 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  HornetQ Backup Server version 2.2.5.Final (HQ_2_2_5_FINAL_AS7, 121) [60dcd3f9-728f-11e1-bb31-f04da2a29853] started, waiting live to fail before it gets active

      [Thread-0 (group:HornetQ-server-threads32319460-2862744)] 15:10:58,019 WARNING [org.hornetq.core.client.impl.ClientSessionFactoryImpl]  Tried 1 times to connect. Now giving up on reconnecting it.

      [Thread-0 (group:HornetQ-server-threads32319460-2862744)] 15:10:58,019 WARNING [org.hornetq.core.server.cluster.impl.ClusterManagerImpl]  Unable to announce backup

      HornetQException[errorCode=2 message=Cannot connect to server(s). Tried with all available servers.]

                at org.hornetq.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:619)

                at org.hornetq.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:503)

                at org.hornetq.core.server.cluster.impl.ClusterManagerImpl$1.run(ClusterManagerImpl.java:860)

                at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)

                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

                at java.lang.Thread.run(Thread.java:662)

       

      I tried to kill again HQ1, and HQ2 is really not backup and does not detect it should become live.

       

      Any idea why my second node fails to become backup again ?

       

      I am using hornetQ 2.2.5 final, with sun java 6 on linux debian 6 servers

      I attach my configuration files for both servers.

      Both host share a common directory, used to share the journal and the queues.

       

      Thanks in advance.

      Guillaume D.

        • 1. Re: HornetQ HA tests, fails to go back to normal state
          ataylor

          is the backup announcing itself properly on initial startup?

           

          also try using propert ip addreses rather than hessaging-02 and 0.0.0.0 to see if that works

          • 2. Re: HornetQ HA tests, fails to go back to normal state
            gdegremont

            Yes, on initial startup, the backup annonced itself correctly:

            [Thread-1] 18:06:12,431 INFO [org.hornetq.core.server.cluster.impl.ClusterManagerImpl]  announcing backup

            [Thread-1] 18:06:12,433 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  HornetQ Backup Server version 2.2.5.Final (HQ_2_2_5_FINAL_AS7, 121) [60dcd3f9-728f-11e1-bb31-f04da2a29853] started, waiting live to fail before it gets active

            [Thread-0 (group:HornetQ-server-threads8032804-26760685)] 18:06:12,763 INFO [org.hornetq.core.server.cluster.impl.ClusterManagerImpl]  backup announced

             

            I will re-run the test with ip addresses instead of dns names