5 Replies Latest reply on Apr 6, 2012 10:35 AM by perltom

    Automatic fail-over is not working

    Newbie

      I have configured a symmetric cluster with two live nodes and two back-up nodes. The live and back-up servers share a file system. Everything works great in terms of sending and consuming messages in the cluster, but the fail-over on the cluster side is not working. If the live server is shut down the clients do not try to connect to either the back-up server that becomes a live server or to the second live node in the cluster.

       

      The attached files show the configuration of one of the live nodes. I have checked the documentation and the examples many times to confirm these settings but it looks like I might have missed one.

       

      I am using  standalone HornetQ Server version 2.2.5.Final (HQ_2_2_5_FINAL_AS7, 121) on RHEL 5.7, and with Java 1.6.0_29 64 bit.

       

      The client throws the following error when the live server is shut down and it is not able to recover:

       

      javax.jms.IllegalStateException: Producer is closed

              at org.hornetq.jms.client.HornetQMessageProducer.checkClosed(HornetQMessageProducer.java:463)

              at org.hornetq.jms.client.HornetQMessageProducer.send(HornetQMessageProducer.java:193)

       

       

      The clients use /ConnectionFactory to connect to the server.

       

      Thanks,

       

      DImitar

        • 1. Re: Automatic fail-over is not working
          Andy Taylor Master

          do you see the backup server announce itself, something like "backup announced".

           

          also if they are on the same machine make sure you have a loopback address configured

          • 2. Re: Automatic fail-over is not working
            Newbie

            Hi Andy,

             

            Yes, to both questions:

            1. the live and the the back-up servers are on the same host
            2. the back-up server announce itself correctly.

             

            The following are the output of the log files

             

            Live server 's log

            ==================================

            [main] 09:34:47,557 INFO [org.hornetq.integration.bootstrap.HornetQBootstrapServer]  Starting HornetQ Server

            [main] 09:34:48,768 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=true,journalDirectory=/jms02/journal,bindingsDirectory=/jms02/bindings,largeMessagesDirectory=/jms02/large-messages,pagingDirectory=/jms02/paging)

            [main] 09:34:48,769 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  Waiting to obtain live lock

            [main] 09:34:48,804 INFO [org.hornetq.core.persistence.impl.journal.JournalStorageManager]  Using AIO Journal

            [main] 09:34:49,007 INFO [org.hornetq.core.server.impl.AIOFileLockNodeManager]  Waiting to obtain live lock

            [main] 09:34:49,008 INFO [org.hornetq.core.server.impl.AIOFileLockNodeManager]  Live Server Obtained live lock

            [main] 09:34:56,863 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.queue.DLQ

            [main] 09:34:56,887 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.queue.ExpiryQueue

            [main] 09:34:56,893 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.queue.ExampleQueue

            [main] 09:34:56,898 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.topic.CacheMQTopic

            [main] 09:34:56,907 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.topic.exampleTopic

            [main] 09:34:57,008 INFO [org.hornetq.core.remoting.impl.netty.NettyAcceptor]  Started Netty Acceptor version 3.2.3.Final-r${buildNumber} wpqajms02.wiley.com:10500 for CORE protocol

            [main] 09:34:57,011 INFO [org.hornetq.core.remoting.impl.netty.NettyAcceptor]  Started Netty Acceptor version 3.2.3.Final-r${buildNumber} wpqajms02.wiley.com:10400 for CORE protocol

            [main] 09:34:57,028 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  Server is now live

            [main] 09:34:57,028 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  HornetQ Server version 2.2.5.Final (HQ_2_2_5_FINAL_AS7, 121) [0fe2b036-55ed-11e1-be2b-001a64664c6a] started

            [Thread-29 (group:HornetQ-server-threads1173925084-435456241)] 09:34:57,076 INFO [org.hornetq.core.server.cluster.impl.BridgeImpl]  Connecting bridge sf.wileyplus-cluster.3f1be2a6-54f8-11e1-860b-0015172ce7cd to its destination [0fe2b036-55ed-11e1-be2b-001a64664c6a]

            [Thread-0 (group:HornetQ-server-threads1173925084-435456241)] 09:34:57,108 INFO [org.hornetq.core.server.cluster.impl.BridgeImpl]  Connecting bridge sf.wileyplus-cluster.4c7983db-54f8-11e1-885c-001a64664c6a to its destination [0fe2b036-55ed-11e1-be2b-001a64664c6a]

            [Thread-0 (group:HornetQ-server-threads1173925084-435456241)] 09:34:57,149 INFO [org.hornetq.core.server.cluster.impl.BridgeImpl]  Bridge sf.wileyplus-cluster.4c7983db-54f8-11e1-885c-001a64664c6a is connected [0fe2b036-55ed-11e1-be2b-001a64664c6a-> sf.wileyplus-cluster.4c7983db-54f8-11e1-885c-001a64664c6a]

            [Thread-29 (group:HornetQ-server-threads1173925084-435456241)] 09:34:57,157 INFO [org.hornetq.core.server.cluster.impl.BridgeImpl]  Bridge sf.wileyplus-cluster.3f1be2a6-54f8-11e1-860b-0015172ce7cd is connected [0fe2b036-55ed-11e1-be2b-001a64664c6a-> sf.wileyplus-cluster.3f1be2a6-54f8-11e1-860b-0015172ce7cd]

             

             

            The back-up server's

            ======================================

            [main] 09:34:52,759 INFO [org.hornetq.integration.bootstrap.HornetQBootstrapServer]  Starting HornetQ Server

            [main] 09:34:53,837 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  backup server is starting with configuration HornetQ Configuration (clustered=true,backup=true,sharedStore=true,journalDirectory=/jms02/journal,bindingsDirectory=/jms02/bindings,largeMessagesDirectory=/jms02/large-messages,pagingDirectory=/jms02/paging)

            [Thread-1] 09:34:53,840 INFO [org.hornetq.core.server.impl.AIOFileLockNodeManager]  Waiting to become backup node

            [Thread-1] 09:34:53,841 INFO [org.hornetq.core.server.impl.AIOFileLockNodeManager]  ** got backup lock

            [Thread-1] 09:34:53,871 INFO [org.hornetq.core.persistence.impl.journal.JournalStorageManager]  Using AIO Journal

            [Thread-1] 09:34:54,028 INFO [org.hornetq.core.server.cluster.impl.ClusterManagerImpl]  announcing backup

            [Thread-1] 09:34:54,029 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  HornetQ Backup Server version 2.2.5.Final (HQ_2_2_5_FINAL_AS7, 121) [0fe2b036-55ed-11e1-be2b-001a64664c6a] started, waiting live to fail before it gets active

            [Thread-0 (group:HornetQ-server-threads452794384-1921072065)] 09:34:55,085 INFO [org.hornetq.core.server.cluster.impl.ClusterManagerImpl]  backup announced

             

            Live server goes down

            ====================================

            [hornetq-shutdown-thread] 09:39:05,551 INFO [org.hornetq.integration.bootstrap.HornetQBootstrapServer]  Stopping HornetQ Server...

             

            Back-up becomes live

            ====================================

            [Thread-1] 09:39:14,077 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.queue.DLQ

            [Thread-1] 09:39:14,099 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.queue.ExpiryQueue

            [Thread-1] 09:39:14,105 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.queue.ExampleQueue

            [Thread-1] 09:39:14,111 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  trying to deploy queue jms.topic.CacheMQTopic

            [Thread-1] 09:39:14,174 INFO [org.hornetq.core.remoting.impl.netty.NettyAcceptor]  Started Netty Acceptor version 3.2.3.Final-r${buildNumber} wpqajms02.wiley.com:11400 for CORE protocol

            [Thread-1] 09:39:14,177 INFO [org.hornetq.core.remoting.impl.netty.NettyAcceptor]  Started Netty Acceptor version 3.2.3.Final-r${buildNumber} wpqajms02.wiley.com:11500 for CORE protocol

            [Thread-1] 09:39:14,238 INFO [org.hornetq.core.server.impl.HornetQServerImpl]  Backup Server is now live

            [Thread-8 (group:HornetQ-server-threads452794384-1921072065)] 09:39:14,448 INFO [org.hornetq.core.server.cluster.impl.BridgeImpl]  Connecting bridge sf.wileyplus-cluster.3f1be2a6-54f8-11e1-860b-0015172ce7cd to its destination [0fe2b036-55ed-11e1-be2b-001a64664c6a]

            [Thread-2 (group:HornetQ-server-threads452794384-1921072065)] 09:39:14,448 INFO [org.hornetq.core.server.cluster.impl.BridgeImpl]  Connecting bridge sf.wileyplus-cluster.4c7983db-54f8-11e1-885c-001a64664c6a to its destination [0fe2b036-55ed-11e1-be2b-001a64664c6a]

            [Thread-8 (group:HornetQ-server-threads452794384-1921072065)] 09:39:14,490 INFO [org.hornetq.core.server.cluster.impl.BridgeImpl]  Bridge sf.wileyplus-cluster.3f1be2a6-54f8-11e1-860b-0015172ce7cd is connected [0fe2b036-55ed-11e1-be2b-001a64664c6a-> sf.wileyplus-cluster.3f1be2a6-54f8-11e1-860b-0015172ce7cd]

             

            The following are the settings used to set-up both servers. The settings are read by the run.sh script

             

            # Cluster settings

            # a1

            data_dir_a1=/jms01

            jnp_port_a1=10000

            jnp_rmiPort_a1=10100

            jmx_port_a1=10200

            hq_host_a1=myhost

            remoting_netty_port_a1=10400

            remoting_netty_batch_port_a1=10500

             

             

            # a2

            data_dir_a2=/jms01

            jnp_port_a2=11000

            jnp_rmiPort_a2=11100

            jmx_port_a2=11200

            hq_host_a2=myhost

            remoting_netty_port_a2=11400

            remoting_netty_batch_port_a2=11500

             

            Each node in the cluster consists of a live and back-up server that access a SAN shared file system - /jms01.

             

            I hope this helped.

             

            thanks

             

            Dimitar

            • 3. Re: Automatic fail-over is not working
              gerhard kopps Newbie

              I'm having a similar issue.

              I'm using a spring-jms based client, which fails to redirect to the backup server. Instead, it tries to create a new jndi connection factory from the previou live node.

              I successfully ran the jms/non-transaction-failover example coming along with version 2.2.5-Final, which uses plain JMS, so I guess my problem is either spring-jms doesn't work properly or I'm not using it correctly.

               

              Are you using spring-jms?

               

              Cheers.

              • 4. Re: Automatic fail-over is not working
                Newbie

                The primary application that will use the cluster is based on EJB 2.1 but it is using HornetQ JMS API to connect to the server. At the moment, this application can neither re-connect to the server not fail-over to another node in the cluster. However the next version of this application (still in beta) is based on Spring and it looks like the developers working on it were successful in implementing all HA features - re-attach, re-connect and fail-over. I plan to verify this in the next few days and post my findings in this thread.

                 

                I also have to mention that a Java test client was developed to help us test the performance of the HornetQ stand-alone server and cluster. It cannot fail-over as well.  The Java client was developed in Java 6, it is using HornetQ JMS API and was planned to run in Grinder types of tests.

                • 5. Re: Automatic fail-over is not working
                  Newbie

                  While working on this issue I realized its description would need to be broken into two parts

                  1. Client session re-connection
                  2. Client fail-over

                   

                  #1 was addressed through adding the following setting to the Connection factory (connection-factory element) in hornetq-jms.xml configuration file.

                    <confirmation-window-size>1048676</confirmation-window-size>

                   

                  which basically enables the buffering of messages on the client side and sets its value to 1 MB. The default value is -1. 

                  Tunning of connection-ttl  and client-failure-check-period settings also made sure that connections are not dropped by either the HornetQ server or the clients during peak days.

                   

                  I still need to test #2 with these changes but at this point we are addressing the client-fail over requirements by implementing listeners on the client side that will notify the client when connection fails.  According to the documentation client fail-over should work out of the box but at this point I am not sure this is the case. I guess this feature would need further research and a review of the requirements.

                   

                  In any case HornetQ works really great in a production environment. With about 4K msgs/min CPU and memory utilization of the HornetQ process is really low and efficient.

                   

                  More updates to come..