9 Replies Latest reply on May 19, 2014 10:12 AM by jbertram

    Replication FailOver Not Working

    yairogen

      I am trying out the replication failover but can't seem to get it to work.

       

      I am using the core API and building the session factory using:

      HornetQClient.createServerLocatorWithHA(transportConfigurationsArray);
      
      

       

      I've setup 2 nodes in the same cluster. one primary and one backup. I first thought I only need to supply a single transport configuration object with the ip of primary node. Trying to failover didn't work and I get connection errors in the client. I then tried to provide to transports one for primary and the other for backup and I see strange behavior. When both nodes were up I saw active/active like behavior. When I stopped primary - I again saw the same connection errors in the client log. Following are the configuration xml for both primary and backup. Files copied from /config/stand-alone/clustered folder in both nodes. I start the servers by running "./run.sh /config/stand-alone/clustered".

       

      Error seen in client when shutting down primary is:

       

      Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
        at org.hornetq.core.client.impl.ServerLocatorImpl.selectConnector(ServerLocatorImpl.java:603)
        at org.hornetq.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:853)
      

       

      Primary:

      <configuration xmlns="urn:hornetq"
                     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                     xsi:schemaLocation="urn:hornetq /schema/hornetq-configuration.xsd">
        
         <paging-directory>${data.dir:../data}/paging</paging-directory>
        
         <bindings-directory>${data.dir:../data}/bindings</bindings-directory>
        
         <journal-directory>${data.dir:../data}/journal</journal-directory>
        
         <journal-min-files>10</journal-min-files>
        
         <large-messages-directory>${data.dir:../data}/large-messages</large-messages-directory>
      
      
         <connectors>     
           
         <connector name="netty">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.122}"/>
               <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>
            </connector> 
           
            <connector name="netty-throughput">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.122}"/>
               <param key="port"  value="${hornetq.remoting.netty.batch.port:5455}"/>
               <param key="batch-delay" value="50"/>
            </connector>
         </connectors>
      
      
         <acceptors>
            <acceptor name="netty">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.122}"/>
               <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>
            </acceptor>
           
            <acceptor name="netty-throughput">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.122}"/>
               <param key="port"  value="${hornetq.remoting.netty.batch.port:5455}"/>
               <param key="batch-delay" value="50"/>
               <param key="direct-deliver" value="false"/>
            </acceptor>
         </acceptors>
      
      
         <broadcast-groups>
            <broadcast-group name="bg-group1">
               <group-address>231.7.7.7</group-address>
               <group-port>9876</group-port>
               <broadcast-period>5000</broadcast-period>
               <connector-ref>netty</connector-ref>
            </broadcast-group>
         </broadcast-groups>
      
      
         <discovery-groups>
            <discovery-group name="dg-group1">
               <group-address>231.7.7.7</group-address>
               <group-port>9876</group-port>
               <refresh-timeout>10000</refresh-timeout>
            </discovery-group>
         </discovery-groups>
      
      
         <shared-store>false</shared-store>
        
         <cluster-connections>
            <cluster-connection name="my-cluster">
               <address>foundation</address> 
               <connector-ref>netty</connector-ref>
            <discovery-group-ref discovery-group-name="dg-group1"/>
        <!--
        <static-connectors>
            <connector-ref>netty</connector-ref>
            <connector-ref>netty-2</connector-ref>
         </static-connectors>
         -->
            </cluster-connection>
         </cluster-connections>
        
         <security-enabled>false</security-enabled>
        
         <security-settings>
            <security-setting match="#">
               <permission type="createNonDurableQueue" roles="guest"/>
               <permission type="deleteNonDurableQueue" roles="guest"/>
               <permission type="consume" roles="guest"/>
               <permission type="send" roles="guest"/>
            </security-setting>
         </security-settings>
      
      
         <address-settings>
            <!--default for catch all-->
            <address-setting match="#">
               <dead-letter-address>jms.queue.DLQ</dead-letter-address>
               <expiry-address>jms.queue.ExpiryQueue</expiry-address>
               <redelivery-delay>0</redelivery-delay>
               <max-size-bytes>10485760</max-size-bytes>      
               <message-counter-history-day-limit>10</message-counter-history-day-limit>
               <address-full-policy>BLOCK</address-full-policy>
            </address-setting>
         </address-settings>
      
      
        
      
      
      </configuration>
      
      

       

      Backup:

       

      <configuration xmlns="urn:hornetq"
                     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                     xsi:schemaLocation="urn:hornetq /schema/hornetq-configuration.xsd">
        
         <paging-directory>${data.dir:../data}/paging</paging-directory>
        
         <bindings-directory>${data.dir:../data}/bindings</bindings-directory>
        
         <journal-directory>${data.dir:../data}/journal</journal-directory>
        
         <journal-min-files>10</journal-min-files>
        
         <large-messages-directory>${data.dir:../data}/large-messages</large-messages-directory>
      
      
         <connectors>     
           
         <connector name="netty">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.123}"/>
               <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>
            </connector>
         <connector name="netty-2">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.123}"/>
               <param key="port"  value="${hornetq.remoting.netty.port:6445}"/>
            </connector>
           
            <connector name="netty-throughput">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.123}"/>
               <param key="port"  value="${hornetq.remoting.netty.batch.port:5455}"/>
               <param key="batch-delay" value="50"/>
            </connector>
         </connectors>
      
      
         <acceptors>
            <acceptor name="netty">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.123}"/>
               <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>
            </acceptor>
           
            <acceptor name="netty-throughput">
               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
               <param key="host"  value="${hornetq.remoting.netty.host:10.45.37.123}"/>
               <param key="port"  value="${hornetq.remoting.netty.batch.port:5455}"/>
               <param key="batch-delay" value="50"/>
               <param key="direct-deliver" value="false"/>
            </acceptor>
         </acceptors>
      
      
         <broadcast-groups>
            <broadcast-group name="bg-group1">
               <group-address>231.7.7.7</group-address>
               <group-port>9876</group-port>
               <broadcast-period>5000</broadcast-period>
               <connector-ref>netty</connector-ref>
            </broadcast-group>
         </broadcast-groups>
      
      
         <discovery-groups>
            <discovery-group name="dg-group1">
               <group-address>231.7.7.7</group-address>
               <group-port>9876</group-port>
               <refresh-timeout>10000</refresh-timeout>
            </discovery-group>
         </discovery-groups>
      
      
         <shared-store>false</shared-store>
      
      
         <backup>true</backup>
        
         <cluster-connections>
            <cluster-connection name="my-cluster">
               <address>foundation</address> 
               <connector-ref>netty</connector-ref>
            <discovery-group-ref discovery-group-name="dg-group1"/>
        <!--
        <static-connectors>
            <connector-ref>netty</connector-ref>
            <connector-ref>netty-2</connector-ref>
         </static-connectors>
         -->
            </cluster-connection>
         </cluster-connections>
        
         <security-enabled>false</security-enabled>
        
         <security-settings>
            <security-setting match="#">
               <permission type="createNonDurableQueue" roles="guest"/>
               <permission type="deleteNonDurableQueue" roles="guest"/>
               <permission type="consume" roles="guest"/>
               <permission type="send" roles="guest"/>
            </security-setting>
         </security-settings>
      
      
         <address-settings>
            <!--default for catch all-->
            <address-setting match="#">
               <dead-letter-address>jms.queue.DLQ</dead-letter-address>
               <expiry-address>jms.queue.ExpiryQueue</expiry-address>
               <redelivery-delay>0</redelivery-delay>
               <max-size-bytes>10485760</max-size-bytes>      
               <message-counter-history-day-limit>10</message-counter-history-day-limit>
               <address-full-policy>BLOCK</address-full-policy>
            </address-setting>
         </address-settings>
      
      
        
      
      
      </configuration>
      
      
        • 1. Re: Replication FailOver Not Working
          jbertram

          Take a look at the replication examples we ship with HornetQ.  Using those you can ensure your configuration is correct.

          • 2. Re: Replication FailOver Not Working
            yairogen

            I did. Examples are JMS based and not core cased (like most of the examples). I think my configuration is solid but still can't understand what's going wrong.

             

            Also - I am not sure if I need to set anything other than default on the server locator instance I'm using. From the manual it looks like the failover should be automatic and transparent.

             

            Any help is appreciated.

            • 3. Re: Replication FailOver Not Working
              jbertram

              So...Alter the most relevant example to use the core API and try to reproduce the problem.  If you can then attach it here.

              • 4. Re: Re: Replication FailOver Not Working
                yairogen

                Re-reviewing the jms example I think I missed out on some connection tuning properties. I've added the following:

                 

                                serverLocator.setRetryInterval(1000);
                                serverLocator.setRetryIntervalMultiplier(1);
                                serverLocator.setReconnectAttempts(-1);
                

                 

                And I now do see failover.

                 

                1. Is this indeed mandatory for the client to failover? From manual it seems that API defaults should also trigger failover. Is that not the case?
                2. I identify that some tests were handled twice in my listeners. Is that expected that some messages are re-delivered during replication? I acknowledge all messages.
                3. When I re-start the primary I see it has messages in the queue although I suspect they were already handled by listeners running against the backup. If client was also re-started it will process these messages although it shouldn't. Note: I can say that message count I see in the active is the same number as the number of meesages that were processed twice. Please advise.
                • 5. Re: Re: Replication FailOver Not Working
                  jbertram

                  Is this indeed mandatory for the client to failover?

                  Technically only reconnect-attempts > 0 is required since retry-interval defaults to 2000 and retry-interval-multiplier defaults to 1.0.  See Chapter 34. Client Reconnection and Session Reattachment.

                   

                  From manual it seems that API defaults should also trigger failover. Is that not the case?

                  To which bit of documentation are you referring?  Section 39.2.1 states, "To enable automatic client failover, the client must be configured to allow non-zero reconnection attempts (as explained in Chapter 34, Client Reconnection and Session Reattachment)."

                   

                  I identify that some tests were handled twice in my listeners. Is that expected that some messages are re-delivered during replication? I acknowledge all messages.

                  I would need more information on your use-case to comment further.  A reproducible test would be ideal.

                   

                  When I re-start the primary I see it has messages in the queue although I suspect they were already handled by listeners running against the backup. If client was also re-started it will process these messages although it shouldn't. Note: I can say that message count I see in the active is the same number as the number of meesages that were processed twice. Please advise.

                  I don't really understand the full use-case here.  Can you explain more?  This might be better on a new thread.  As always, a reproducible test would be ideal.

                  • 6. Re: Re: Replication FailOver Not Working
                    yairogen
                    • reconnect attempts - any reason why in the samples -1 was used?
                    • I missed chapter 34. thanks.
                    • regarding my tests: I have a listener adding the messages string payload it handles into a HashSet. If "set.add(String payload)" returns false - I know for sure that this listener already got a message with this payload. payloads are unique. Now - I counted the number of times I got "duplicate" messages and it is equal to the number of messages on each node. I.e. if I counted 500 "duplicate" messages - I see 250 messages on each node that I didn't yet read. My tests don't read more than the number I sent. So, I sent 10000 - I received 10000, but 500 of these were duplicate (why) which means I see 500 messages (probably unique real messages) that I didn't get.
                    • basically point 3 above illustrates man message are delivered twice although acknowledged and I don't understand why.
                    • 7. Re: Replication FailOver Not Working
                      jbertram

                      I believed -1 was chosen for reconnect-attempts so that no matter how long the example took to run on a user's machine the client would still reconnect.

                       

                      Regarding your use-case, I'm not sure what could be going on there.  I would need a way to reproduce the issue to investigate further.

                      • 8. Re: Replication FailOver Not Working
                        yairogen

                        Can we connect privately so I can give my github access and you can easily see what I'm doing?

                         

                        yairogen AT gmail DOT com

                        • 9. Re: Replication FailOver Not Working
                          jbertram

                          At this point it would probably be best for you to investigate purchasing a support subscription from Red Hat.  That's likely the only way you'll get the kind of support you need. 

                           

                          In any kind of support situation it's typically best to make it as easy as possible to receive the help you want.  Usually the most effective way to do that is with code that's easy for one of the developers to take and run and see exactly the problem you're facing.  That will, of course, require some work on your end to distill the problem to its simplest form and write something that's easy to run.  However, HornetQ has a rich test-suite that you can use for this as well as a lot of examples that could serve as a template.