11 Replies Latest reply on Mar 27, 2008 2:17 PM by jssacristan

    How to configure a cluster with fault tolerance

    jssacristan

      Hi all,

       

      I'm trying to configure a Servicemix cluster with fault tolerance, failover recovery, high availability and persistence of messages. I tried both Servicemix and FUSE ESB, but it doesn't work

       

      I've built an scenario which consists in:

      • A web service which takes a name and returns the string "Hello <name>"

      • A service assembly deployed in Servicemix with a HTTP BC with two endpoints (one provider and one consumer) and a JSR181 SE which insert a 10 seconds wait before giving back the external webservice response to the client.

      • A synchronous Axis client which connects to the cluster.

       

       

      The goal I want to obtain is to shutdown one node of the cluster while it processing the request (insert the wait) and that the other contiue with the execution of that request. Is it possible?

       

      I'm using Master/Slave mode in ActiveMQ. Documentation says that this mode is the one that provides high availability for the purpose of my proof.

       

      I don't know if I'm doing something wrong or maybe it is not viable for my proof, so I need some kind of help to configure a cluster with this features.

       

       

      Regards,

       

      Jorge

       

      Edited by: jssacristan on Mar 14, 2008 8:29 AM

        • 1. Re: How to configure a cluster with fault tolerance
          martinmurphy

          Hi Jorge,

           

          What version of Fuse ESB are you using? Have you had a look at the examples\cluster demo? You can use  JCAFlow to provide clustered transactional persistence.

           

          - Martin

          • 2. Re: How to configure a cluster with fault tolerance
            jssacristan

            Thank you Martin, I'm using Fuse ESB 3.3.0.8.

             

            Cluster demo works ok, but I think it is not the topology I need.

             

            I'm using JMSFlow, what is the difference? It should work both JMS and JCA... Sorry, I'm still a newbie in this technology.

             

             

            Jorge

            • 3. Re: How to configure a cluster with fault tolerance
              martinmurphy

              Sorry it took a while to get back, basically JCAFlow supports transactions, so a message won't be removed from the queue until the flow has completed. With JMSFlow there is still a danger that the message could be lost if the broker died while the message was in a component being processed.

              • 4. Re: How to configure a cluster with fault tolerance
                bsnyder

                 

                The goal I want to obtain is to shutdown one node of the cluster while it processing the request (insert the wait) and that the other contiue with the execution of that request. Is it possible?

                 

                 

                To achieve fault tolerance and high availability (which is essentially what you are describing) you will need to configure failover and message persistence on the ActiveMQ broker (as I see you've already done) since ActiveMQ is used by the NMR to communicate with the JBI components.

                 

                You say that you want to shut down one node in the cluster while processing is in-flight, allowing another node continue with the processing. For this to happen, you will also need:

                 

                1) Have more than one ServiceMix instance running

                2) Have the ServiceMix instances networked via the ActiveMQ configuration

                3) To deploy the same JSR-181 service to more than one ServiceMix instance

                 

                This way there will be more than one instance of the service running so that if one deployment of the service becomes unreachable, the NMR can route the message to another deployment of the service. Other items that will need to be done include:

                 

                1) Set persistent=true on the container element in the conf/servicemix.xml file. This has the affect of telling the ActiveMQ broker to persist messages as they flow through the NMR so that messages are not lost in the event of a failure.

                2) Set the MessageExchange.JTA_TRANSACTION_PROPERTY_NAME property on the message exchange. This can only be achieved via Java code and the best place for it is at the earliest point where the message exchange is created, i.e., in a marshaler on the servicemix-http component. This property affects the quality of service and therefore the flow that the NMR chooses to handle the message exchange.

                 

                The best thing to do is start creating it all and come here with your questions and we'll help you as much as we can.

                 

                Bruce

                • 5. Re: How to configure a cluster with fault tolerance
                  jssacristan

                  Thank you Bruce and Martin

                   

                  Well, in fact Master/Slave doesn't work. I run two instances of FUSE Message Broker 5.0.0.9, one as master and one as slave (see below for configurations). Sometimes I get this error in the slave:

                   

                  ERROR Service                        - Async error occurred: java.lang.IllegalStateException: Cannot remove session that had not been registered: ID:vmplatina1-36131-1206320334896-2:1:-1

                   

                  And whenever I shutdown master I get this error in the slave:

                   

                  ERROR MasterConnector                - Network connection between vm://broker2#0 and tcp:///192.168.205.141:61616 shutdown: null

                  java.io.EOFException

                  WARN  BrokerService                  - Master Failed - starting all connectors

                  ERROR BrokerService                  - Failed to startAllConnectors

                  INFO  TransportConnector             - Connector vm://broker2 Stopped

                   

                  On the other hand, Bruce said:

                   

                  This way there will be more than one instance of the service running so that if one deployment of the service becomes unreachable, the NMR can route the message to another deployment of the service. Other items that will need to be done include:

                  But if I shutdown both servicemix and activemq how can the other node returns the response to a client which is listening the service that is down?

                   

                  -


                  ActiveMQ master:

                   

                  &lt;beans
                  xmlns="http://www.springframework.org/schema/beans"
                  xmlns:amq="http://activemq.org/config/1.0"
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                  xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd
                  http://activemq.org/config/1.0 http://activemq.apache.org/schema/activemq-core.xsd
                  http://activemq.apache.org/camel/schema/spring http://activemq.apache.org/camel/schema/spring/camel-spring.xsd"&gt;
                  
                  &lt;!-- Allows us to use system properties as variables in this configuration file --&gt;
                  &lt;bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"/&gt;
                  
                  &lt;broker xmlns="http://activemq.org/config/1.0" brokerName="broker1" dataDirectory="${activemq.base}/data" persistent="true"&gt;
                  
                  &lt;!-- Destination specific policies using destination names or wildcards --&gt;
                  &lt;destinationPolicy&gt;
                  &lt;policyMap&gt;
                  &lt;policyEntries&gt;
                  
                  &lt;policyEntry topic="FOO.&gt;" producerFlowControl="false" memoryLimit="1mb"&gt;
                  &lt;dispatchPolicy&gt;
                  &lt;strictOrderDispatchPolicy/&gt;
                  &lt;/dispatchPolicy&gt;
                  &lt;subscriptionRecoveryPolicy&gt;
                  &lt;lastImageSubscriptionRecoveryPolicy/&gt;
                  &lt;/subscriptionRecoveryPolicy&gt;
                  &lt;/policyEntry&gt;
                  
                  
                  
                  &lt;/policyEntries&gt;
                  
                  &lt;/policyMap&gt;
                  &lt;/destinationPolicy&gt;
                  
                  
                  &lt;!-- The transport connectors ActiveMQ will listen to --&gt;
                  &lt;transportConnectors&gt;
                  &lt;transportConnector name="openwire" uri="tcp://localhost:61616" discoveryUri="multicast://default"/&gt;
                  &lt;transportConnector name="ssl"     uri="ssl://localhost:61617"/&gt;
                  &lt;transportConnector name="stomp"   uri="stomp://localhost:61613"/&gt;
                  &lt;transportConnector name="xmpp"    uri="xmpp://localhost:61222"/&gt;
                  &lt;/transportConnectors&gt;
                  
                  &lt;!-- The store and forward broker networks ActiveMQ will listen to --&gt;
                  &lt;networkConnectors&gt;
                  
                  &lt;/networkConnectors&gt;
                  
                  
                  &lt;/broker&gt;
                  
                  &lt;/beans&gt;
                  

                  -


                  ActiveMQ slave:

                   

                  &lt;beans
                  xmlns="http://www.springframework.org/schema/beans"
                  xmlns:amq="http://activemq.org/config/1.0"
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                  xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd
                  http://activemq.org/config/1.0 http://activemq.apache.org/schema/activemq-core.xsd
                  http://activemq.apache.org/camel/schema/spring http://activemq.apache.org/camel/schema/spring/camel-spring.xsd"&gt;
                  
                  &lt;!-- Allows us to use system properties as variables in this configuration file --&gt;
                  &lt;bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"/&gt;
                  
                  &lt;broker xmlns="http://activemq.org/config/1.0" brokerName="broker2" dataDirectory="${activemq.base}/data" masterConnectorURI="tc
                  p://192.168.205.141:61616" shutdownOnMasterFailure="false" persistent="true"&gt;
                  
                  &lt;!-- Destination specific policies using destination names or wildcards --&gt;
                  &lt;destinationPolicy&gt;
                  &lt;policyMap&gt;
                  &lt;policyEntries&gt;
                  
                  &lt;policyEntry topic="FOO.&gt;" producerFlowControl="false" memoryLimit="1mb"&gt;
                  &lt;dispatchPolicy&gt;
                  &lt;strictOrderDispatchPolicy/&gt;
                  &lt;/dispatchPolicy&gt;
                  &lt;subscriptionRecoveryPolicy&gt;
                  &lt;lastImageSubscriptionRecoveryPolicy/&gt;
                  &lt;/subscriptionRecoveryPolicy&gt;
                  &lt;/policyEntry&gt;
                  
                  &lt;/policyEntries&gt;
                  &lt;/policyMap&gt;
                  &lt;/destinationPolicy&gt;
                  
                  &lt;!-- The transport connectors ActiveMQ will listen to --&gt;
                  &lt;transportConnectors&gt;
                  &lt;transportConnector name="openwire" uri="tcp://localhost:61616" discoveryUri="multicast://default"/&gt;
                  &lt;transportConnector name="ssl"     uri="ssl://localhost:61617"/&gt;
                  &lt;transportConnector name="stomp"   uri="stomp://localhost:61613"/&gt;
                  &lt;transportConnector name="xmpp"    uri="xmpp://localhost:61222"/&gt;
                  &lt;/transportConnectors&gt;
                  
                  &lt;!-- The store and forward broker networks ActiveMQ will listen to --&gt;
                  &lt;networkConnectors&gt;
                  
                  &lt;/networkConnectors&gt;
                  

                   

                  • 6. Re: How to configure a cluster with fault tolerance
                    bsnyder

                     

                    Well, in fact Master/Slave doesn't work. I run two instances of FUSE Message Broker 5.0.0.9, one as master and one as slave (see below for configurations). Sometimes I get this error in the slave:

                     

                    ERROR Service - Async error occurred: java.lang.IllegalStateException: Cannot remove session that had not been registered: ID:vmplatina1-36131-1206320334896-2:1:-1

                     

                     

                    The error above appears to be a known issue currently as is identified via AMQ-1464.

                     

                     

                    And whenever I shutdown master I get this error in the slave:

                     

                    ERROR MasterConnector - Network connection between vm://broker2#0 and tcp:///192.168.205.141:61616 shutdown: null

                    java.io.EOFException

                    WARN BrokerService - Master Failed - starting all connectors

                    ERROR BrokerService - Failed to startAllConnectors

                    INFO TransportConnector - Connector vm://broker2 Stopped

                     

                     

                    I'm not sure why the slave is unable to start it's connectors exactly, are the master and slave brokers out of sync possibly? Take a look at the following steps for the manual synchronization of a master and slave:

                     

                    http://activemq.apache.org/masterslave.html#MasterSlave-RecoveryingaMasterSlavetopology

                     

                    Bruce

                    • 7. Re: How to configure a cluster with fault tolerance
                      jssacristan

                      I follow the steps to resync brokers, I do this:

                       

                      vmplatina1:/opt/iona/fuse-message-broker-5.0.0.9# scp -r root@vmplatina2:/opt/iona/fuse-message-broker-5.0.0.9/data .

                       

                      It copies data directory from the slave to the master.

                       

                      I've also tried deleting both data directories but not work.

                       

                      If I delete data directories I get this error in slave but no errors in master, when I shutdown master:

                       

                      ERROR MasterConnector                - Network connection between vm://broker2#0 and tcp:///192.168.205.141:61616 shutdown: null

                      java.io.EOFException

                      at java.io.DataInputStream.readInt(DataInputStream.java:375)

                      ... ...

                      ERROR BrokerService                  - Failed to startAllConnectors

                      INFO  TransportConnector             - Connector vm://broker2 Stopped

                       

                      If I resync, when I shutdown master I get this:

                       

                      //MASTER//

                       

                      INFO  BrokerService                  - ActiveMQ Message Broker (broker1, ID:vmplatina1-45019-1206400757923-0:0) is shutting down

                      WARN  ActiveMQConnection             - Async exception with no exception listener: java.io.EOFException

                      java.io.EOFException

                      at java.io.DataInputStream.readInt(DataInputStream.java:375)

                      ... ...

                      ERROR efaultMessageListenerContainer - Setup of JMS message listener invoker failed - trying to recover

                      javax.jms.IllegalStateException: The Consumer is closed

                      at org.apache.activemq.ActiveMQMessageConsumer.checkClosed(ActiveMQMessageConsumer.java:681)

                      ... ...

                      INFO  TransportConnector             - Connector openwire Stopped

                      INFO  TransportConnector             - Connector ssl Stopped

                      INFO  TransportConnector             - Connector stomp Stopped

                      INFO  TransportConnector             - Connector xmpp Stopped

                      INFO  BrokerService                  - ActiveMQ JMS Message Broker (broker1, ID:vmplatina1-45019-1206400757923-0:0) stopped

                       

                       

                      //SLAVE//

                       

                      ERROR Service                        - Async error occurred: java.lang.IllegalStateException: Cannot remove session that had not been registered: ID:vmplatina1-45019-1206400757923-2:1:-1

                      java.lang.IllegalStateException: Cannot remove session that had not been registered: ID:vmplatina1-45019-1206400757923-2:1:-1

                      at org.apache.activemq.broker.TransportConnection.processRemoveSession(TransportConnection.java:576)

                      ... ...

                      ERROR Service                        - Async error occurred: java.lang.IllegalStateException: Cannot remove session that had not been registered: ID:vmplatina1-45019-1206400757923-2:1:1

                      java.lang.IllegalStateException: Cannot remove session that had not been registered: ID:vmplatina1-45019-1206400757923-2:1:1

                      at org.apache.activemq.broker.TransportConnection.processRemoveSession(TransportConnection.java:576)

                      WARN  MasterConnector                - The Master has shutdown

                      WARN  BrokerService                  - Master Failed - starting all connectors

                      ERROR BrokerService                  - Failed to startAllConnectors

                      INFO  TransportConnector             - Connector vm://broker2 Stopped

                      • 8. Re: How to configure a cluster with fault tolerance
                        jssacristan

                        I tried with ActiveMQ-5.1-SNAPSHOT from yesterday Mar 24 and I didn't get the first error I mentioned (Cannot remove session that had not been registered).

                         

                        I tried to send requests to servicemix, but sometimes I get this error:

                         

                        ERROR MasterBroker                   - Slave Failed

                        javax.jms.JMSException: Slave broker out of sync with master: Acknowledgment (MessageAck {commandId = 234, responseRequired = true, ackType = 2, consumerId = ID:vmplatina1-33180-1206410706631-0:0:20:1, firstMessageId = null, lastMessageId = ID:vmplatina1-33180-1206410706631-0:0:28:1:2, destination = queue://org.apache.servicemix.jms.{http://ejemplos.ws.lawebsemantica.com}SaludoService:Saludo, transactionId = null, messageCount = 1}) was not in the dispatch list: []

                        at org.apache.activemq.broker.region.PrefetchSubscription.acknowledge(PrefetchSubscription.java:344)

                        ... ...

                         

                         

                        Moreover, slave broker still fails when I shutdown master.

                        • 9. Re: How to configure a cluster with fault tolerance
                          jssacristan

                          Please, somebody has an idea?

                           

                          I don't know if I am synchronizing the brokers well, I use this:

                           

                          vmplatina1:/opt/iona/fuse-message-broker-5.0.0.9# scp -r root@vmplatina2:/opt/iona/fuse-message-broker-5.0.0.9/data .

                           

                          It copies data directory from the slave to the master. I've also tried deleting both data directories but it doesn't work.

                          • 10. Re: How to configure a cluster with fault tolerance
                            davestanley

                            Hi Jorge,

                             

                            Its sounds like you need a hot standby configuration (to give you HA).

                             

                            You can do this in a few different ways, but start with the simplest case and setup both processes running on a single node. You would have esb1_host1(master) and esb2_host1(slave) and esb1 goes down, esb2 will come up and listen on the same hostname and port, so the failover will be transparent to the client.

                             

                            In order to achieve this, you will need to have both ESB instances pointing to the same data directory. You will need to change the JMX ports in your servicemix.properties file for each instance but other than that they should have the exact same config and can use the exact same install. Enable the amq:persistence adapter in activemq.xml as follows:

                             

                             <amq:persistenceAdapter>
                                  <amq:journaledJDBC journalLogFiles="5" dataDirectory="./data/amq"></amq:journaledJDBC>
                             </amq:persistenceAdapter>
                            

                             

                            The first process started will be the master. When you start the second instance, the underlying ActiveMQ broker will try and lock its persistent store. As the master already has the lock, the slave will go into standby mode and will wait for the lock to be released. In standy mode, it will not listen on any ports until it can acquire the lock, so effectively its just standing by waiting on the master.

                             

                            If you control-C the master, you should see the slave take over transparently and come up listening on its configured ports (which will be the exact same as the master).

                             

                            Having both processes run on a single node is the simplest scenario. If you want the standby on a separate node, you can - but make sure they both use the same DB (using NFS or whatever).

                             

                            Up to this point, we have just discussed HA and failover. Its also possible to setup a cluster of ESB containers where you have more than one live listening process. In order to do this you need to establish NetworkConnectors between the live ESB instances so they are aware of each other. So theoretically a HA/clustered configuration can look something like this:

                             

                            esb1_host1(master) and esb2_host1(slave)

                            esb1_host2(master) and esb2_host2(slave)

                             

                            Where esb1_host1 and esb1_host2 are servicing requests but esb2_host1 & esb2_host2 are just waiting to take over.

                             

                            The second part to your question seems to be around achieving reliability for the HTTP BC. As you are using HTTP, when the master goes down you are going to loose your connection to the server so you need to be able to gracefully handle this error condition in your axis client - i.e. expect failures. The best you can do is to retry the request.

                             

                            If its possible to use JMS rather than HTTP, this will give you a more robust solution as the JMS consumer can persist the message so that the slave can recover the message when it comes back up.

                             

                            Another alternative maybe to use the CXF BC instead of HTTP BC/JSR181 with CXF on the client side with WS-RM enabled. This might be able to give you the reliability your looking for by handling the retries under the covers for you.

                             

                            Hope this helps,

                            /Dave

                            • 11. Re: How to configure a cluster with fault tolerance
                              jssacristan

                              Thank you very much Dave! Your answer is very useful and interesting for me, now I know how I can advance. As soon as I can I'll prove all what you have commented and I'll post it in the forum.

                               

                              Regards,

                               

                              Jorge