10 Replies Latest reply on Jul 7, 2009 11:25 AM by gaohoward

    Clusted FailOver doesn't seem to work correctly with Bridge

    artp

      I have a cluster that I'll call cluster A that has a Bridge that moves messages from a Cluster A queue to another cluster's(Cluster B) queue. Cluster B contains two nodes(B0,B1). Initially, when i start all the nodes up all is well, messages are moved from Cluster A to one node(ie B0) in Cluster B.

      When i take down B0, i get failover messages, on B1, saying that fail over is happening and it's completed. Natually, i'd expect further messages from Cluster A to be processed by B1. Sometimes FailOver works to B1, but most of the time it doesn't. When it doesn't I see the MessageCount and DeliveryCount grow higher has i send more messages from Cluster A. It looks as if the Bridge is doing it's job since i can see the messagecount from the jmx console. FailOver appears to be causing something to go wrong once messages get to B1. From that point on messages are stuck on B1.

      To get the messages out of the stuck state i have to bring nodes up and down(ie. bring B0 up then take down B1). Eventually, the FailOver kicks in and one Node proceses the messages. Once that happens, messages are processed.

      Is there something i'm missing?

      Environment:
      jboss 5.1, JBM 1.4.4 src, I modified JMSRemoteConnection to compile against Remoting 2.5.1, so we could have the Bridge fixes, etc.

        • 1. Re: Clusted FailOver doesn't seem to work correctly with Bri
          gaohoward

          Hi,

          Can you please post your bridge service configure here?

          Thanks
          Howard

          • 2. Re: Clusted FailOver doesn't seem to work correctly with Bri
            artp

            I forgot to mention that we have the bridge running as a HASingleton on cluster A. The JMSProvider, has two nodes(ie Cluster B nodes B0,B1) configured in the url provider.

            <mbean code="org.jboss.jms.server.bridge.BridgeService"
             name="jboss.messaging:service=Bridge,name=OTSPackageBridge"
             xmbean-dd="xmdesc/Bridge-xmbean.xml">
            
             <!-- The JMS provider loader that is used to lookup the source destination -->
             <depends optional-attribute-name="SourceProviderLoader">
             jboss.messaging:service=JMSProviderLoader,name=HAJNDIJMSProvider</depends>
            
             <!-- The JMS provider loader that is used to lookup the target destination -->
             <depends optional-attribute-name="TargetProviderLoader">
             jboss.messaging:service=JMSProviderLoader,name=OTS-HAJNDIJMSProvider</depends>
            
             <!-- The JNDI lookup for the source destination -->
             <attribute name="SourceDestinationLookup">com.company.wop.ots.event.queue</attribute>
            
             <!-- The JNDI lookup for the target destination -->
             <attribute name="TargetDestinationLookup">com.company.wop.ots.event.queue</attribute>
             <attribute name="QualityOfServiceMode">0</attribute>
            
             <!-- The maximum number of messages to consume from the source
             before sending to the target -->
             <attribute name="MaxBatchSize">5</attribute>
            
             <!-- The maximum time to wait (in ms) before sending a batch to the target
             even if MaxBatchSize is not exceeded.
             -1 means wait forever -->
             <attribute name="MaxBatchTime">5000</attribute>
             <!-- The number of ms to wait between connection retrues in the event connections
             to source or target fail -->
             <attribute name="FailureRetryInterval">5000</attribute>
            
             <!-- The maximum number of connection retries to make in case of failure,
             before giving up -1 means try forever-->
             <attribute name="MaxRetries">-1</attribute>
            
             <!-- If true then the message id of the message before bridging will be added
             as a header to the message so it is available to the receiver. Can then be
             sent as correlation id to correlate in a distributed request-response -->
             <attribute name="AddMessageIDInHeader">false</attribute>
            
            </mbean>
            


            Here is a snippet of our hajndi-jms-ds.xml

             <!-- The JMS provider loader -->
             <mbean code="org.jboss.jms.jndi.JMSProviderLoader"
             name="jboss.messaging:service=JMSProviderLoader,name=HAJNDIJMSProvider">
             <attribute name="ProviderName">DefaultJMSProvider</attribute>
             <attribute name="ProviderAdapterClass">
             org.jboss.jms.jndi.JNDIProviderAdapter
             </attribute>
             <!-- The combined connection factory -->
             <attribute name="FactoryRef">XAConnectionFactory</attribute>
             <!-- The queue connection factory -->
             <attribute name="QueueFactoryRef">XAConnectionFactory</attribute>
             <!-- The topic factory -->
             <attribute name="TopicFactoryRef">XAConnectionFactory</attribute>
             <!-- Access JMS via HAJNDI -->
             <attribute name="Properties">
             java.naming.factory.initial=org.jnp.interfaces.NamingContextFactory
             java.naming.factory.url.pkgs=org.jboss.naming:org.jnp.interfaces
             java.naming.provider.url=${jboss.bind.address:localhost}:1100
             jnp.disableDiscovery=false
             jnp.partitionName=${jboss.partition.name:DefaultPartition}
             jnp.discoveryGroup=${jboss.partition.udpGroup:230.0.0.4}
             jnp.discoveryPort=1102
             jnp.discoveryTTL=16
             jnp.discoveryTimeout=5000
             jnp.maxRetries=1
             </attribute>
             </mbean>
            

            Note:We changed the connection provider to ConnectioFactory, since ClusteredConnectionFactory didn't work

            <mbean code="org.jboss.jms.jndi.JMSProviderLoader"
             name="jboss.messaging:service=JMSProviderLoader,name=OTS-HAJNDIJMSProvider">
             <attribute name="ProviderName">OTSJMSProvider</attribute>
             <attribute name="ProviderAdapterClass">
             org.jboss.jms.jndi.JNDIProviderAdapter
             </attribute>
             <!-- The combined connection factory -->
             <attribute name="FactoryRef">ConnectionFactory</attribute>
             <!-- The queue connection factory -->
             <attribute name="QueueFactoryRef">ConnectionFactory</attribute>
             <!-- The topic factory -->
             <attribute name="TopicFactoryRef">ConnectionFactory</attribute>
             <!-- Access JMS via HAJNDI -->
             <attribute name="Properties">
             java.naming.factory.initial=org.jnp.interfaces.NamingContextFactory
             java.naming.factory.url.pkgs=org.jboss.naming:org.jnp.interfaces
             java.naming.provider.url=dev-ots01.test.company.com:1100,dev-ots02.test.company.com:1100
             jnp.disableDiscovery=false
             jnp.partitionName=OTS-dev
             jnp.discoveryGroup=228.1.2.21
             jnp.discoveryPort=1102
             jnp.discoveryTTL=16
             jnp.discoveryTimeout=5000
             jnp.maxRetries=1
             </attribute>
             </mbean>
            


            • 3. Re: Clusted FailOver doesn't seem to work correctly with Bri
              artp

              We're using postgres to persist messages. In the file postgresql-persistence-service.xml, I configured the PostOffice to

               <attribute name="Clustered">true</attribute>
               <attribute name="FailoverOnNodeLeave">true</attribute>
              


              Is this correct?


              • 4. Re: Clusted FailOver doesn't seem to work correctly with Bri
                clebert.suconic

                 


                jboss 5.1, JBM 1.4.4 src, I modified JMSRemoteConnection to compile against Remoting 2.5.1, so we could have the Bridge fixes, etc.



                I believe some of the changes Howard made on trunk will also require some changes on Remoting 2.5.X, but I don't think the required changes have been made yet.

                Howard: do you have more details about this?


                artp: as you are runninig from source: did you run the complete testsuite? You may have bugs at your code that we haven't tested yet. (the pending changes on remoting may be an issue for you).

                • 5. Re: Clusted FailOver doesn't seem to work correctly with Bri
                  clebert.suconic

                   

                  I believe some of the changes Howard made on trunk


                  typo... Old habit.. I meant.. Branch_1_4.

                  • 6. Re: Clusted FailOver doesn't seem to work correctly with Bri
                    timfox

                    If you want a client in JBM 1.x to failover automatically from one node to another you need to set supportsFailover to true in the descriptor where the connection factory is deployed.

                    • 7. Re: Clusted FailOver doesn't seem to work correctly with Bri
                      artp

                      Ideally, I should be using ClusteredConnectionFactory, but i haven't gotten that to work. Messages get stuck on Cluster B when using ClusteredConnectionFactory.It seems like there are issues with ClusteredConnectionFactory. I'll try creating a custom ConnectionFactory with only SupportsFailover set to true to see if it helps.

                      • 8. Re: Clusted FailOver doesn't seem to work correctly with Bri
                        timfox

                        It depends what you want to do. When you originally said failover I assume you really meant reconnection, in which case you don't need a clustered connection factory.

                        • 9. Re: Clusted FailOver doesn't seem to work correctly with Bri
                          artp

                          I tried setting SupportsFailOver to true on the Connection. Then, I tried testing failover by taking down node B0, and fail over worked, messages went to B1. After i period of time i brought up B0 and took down B1. Messages never got delievered to B0 and piled up in the message/deliver count queue. I brought up B1, and still nothing happened until i took down B0, then fail over occured and B1 processed the backlog of messages. Next i took down B1 and once again i end up in the messages stuck situation.

                          It seems to me that FailOver works the first time I take down node B0, but after that FailOver doesn't work properly. Is there anything else i can try?



                          • 10. Re: Clusted FailOver doesn't seem to work correctly with Bri
                            gaohoward

                            Hi,

                            Just for your info, using remoting 2.5.1 cannot solve the message stuck issue. You probably need to wait for 2.5.2 release. and then better wait for a new JBM release for AS 5.

                            Thanks.