8 Replies Latest reply on Sep 25, 2013 2:17 PM by brenuart

    Core Bridge with target on live/backup doesn't failover

    brenuart

      Hello everybody,

       

      I am currently setting up the following HornetQ deployment:

      • two remote sites with WAN connection (two different networks)
      • site A is running a pair of HornetQ instances, standalone, configured as live/backup with multicast discovery (call them A-Live/172.16.1.6 and A-Backup/172.16.1.7)
      • site B is running an HornetQ standalone (call it B)
      • same versions of HornetQ everywhere
      • a bridge is configured at site B to forward content of queue Q from site B to site A. Bridge definition refers to a static connector pointing to A-Live only.

       

      Problem

       

      Core bridge doesn't failover to backup server after live server is killed.

       

      The problem seems to be related to the issue https://issues.jboss.org/browse/HORNETQ-1218

      This issue appears to be fixed in versions 2.3.3.Final and 2.4.0.Alpha1. So I made tests with the following versions but couldn't get it working:

      • HornetQ 2.3.0.Final (affected by the issue so it shouldn't work)
      • HornetQ 2.3.8.Final (issue fixed - should work but doesn't)
      • HornetQ 2.4.0.Beta1 (issue fixed - should work but doesn't)

       

      I suppose I made something wrong in my configuration but can't find what :-(

       

       

      Scenario

       

      The live/backup configuration seems to work: backup takes over when live is killed. Remote consumer/producer clients, running on site A or B, are transparently redirected to A-Backup (they use JNDI to get access to the ConnectionFactory and the queue). When A-Live is restarted, it discovers A-Backup, synchronises its content and ask it to shutdown (as per the configuration).

       

      The bridge configuration seems to work as well: messages posted on queue Q at site B are properly (and transparently) forwarded to queue Q on A-Live.

      However, the bridge fails to reconnect to B-Backup after B-Live is killed.

       

      When running with logging set at DEBUG level, one can see the following messages at site B (bridge source) when the topology changes at site A. The example below shows what happen when B-Backup is started and ready (log message is indented for better reading):

       

      DEBUG [org.hornetq.core.client] ClientSessionFactoryImpl received backup update for live/backup pair =
      TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-6 /
      TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-7
      but it didn't belong to
      TransportConfiguration(name=central-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6
      
      

       

      As far as I can tell, "bridge" is notified of the new cluster topology (in this case the addition of B-Backup with IP 172.16.1.7) but refuses to consider it because it doesn't belong to the configuration. If A-Live is killed, here is what happens to the bridge:

       

      DEBUG [org.hornetq.core.client] calling cleanup on ClientSessionImpl [name=4fd16b4f-1fd0-11e3-be41-8151284bb877, username=HORNETQ.CLUSTER.ADMIN.USER, closed=false, factory = ClientSessionFactoryImpl [serverLocator=ServerLocatorImpl (identity=Bridge my-bridge) [initialConnectors=[TransportConfiguration(name=central-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6], discoveryGroupConfiguration=null], connectorConfig=TransportConfiguration(name=central-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6, backupConfig=null], metaData=()]@544e732e
      DEBUG [org.hornetq.core.client] Trying reconnection attempt 0/0
      DEBUG [org.hornetq.core.client] Trying to connect with connector = org.hornetq.core.remoting.impl.netty.NettyConnectorFactory@51b17cc0, parameters = {port=5455, host=172.16.1.6} connector = NettyConnector [host=172.16.1.6, port=5455, httpEnabled=false, useServlet=false, servletPath=/messaging/HornetQServlet, sslEnabled=false, useNio=false]
      DEBUG [org.hornetq.core.client] Started Netty Connector version 3.6.6.Final-90e1eb2
      DEBUG [org.hornetq.core.client] Trying to connect at the main server using connector :TransportConfiguration(name=central-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6
      DEBUG [org.hornetq.core.client] Remote destination: /172.16.1.6:5455
      DEBUG [org.hornetq.core.client] Main server is not up. Hopefully there's a backup configured now!
      DEBUG [org.hornetq.core.client] Could not connect to any server. Didn't have reconnection configured on the ClientSessionFactory
      DEBUG [org.hornetq.core.client] Trying reconnection attempt 0/0
      DEBUG [org.hornetq.core.client] Trying to connect with connector = org.hornetq.core.remoting.impl.netty.NettyConnectorFactory@7198dab2, parameters = {port=5455, host=172.16.1.6} connector = NettyConnector [host=172.16.1.6, port=5455, httpEnabled=false, useServlet=false, servletPath=/messaging/HornetQServlet, sslEnabled=false, useNio=false]
      DEBUG [org.hornetq.core.client] Started Netty Connector version 3.6.6.Final-90e1eb2
      DEBUG [org.hornetq.core.client] Trying to connect at the main server using connector :TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6
      
      

       

      The bridge will keep trying to reconnect to the original A-Live server forever - pretending there is no backup.

       

       

      Configuration

       

      Live/Backup

      A-Live and A-Backup are running the same configuration.

      Bind address and ports are given at startup by the run.sh script.

      B-Backup is started with -Dhornetq.backup=true.

      Configuration files available in the "SiteA (live/backup).zip" attachment.

       

      Bridge

      Configuration files available in the "SiteB (bridge).zip" attachment.

        • 1. Re: Core Bridge with target on live/backup doesn't failover
          clebert.suconic

          Can you as a test set the reconnect attempts to the cluster connection? Right now it's configured to retry forever.

           

           

          Is there any way you could setup a test?

          • 2. Re: Re: Core Bridge with target on live/backup doesn't failover
            brenuart

            I changed the "bridge" configuration to set a maximum of 2 reconnection attempts.

            Here is how the bridge configuration looks like:

             

            <bridge name="my-bridge">
                <queue-name>jms.queue.testQueue</queue-name>
                <forwarding-address>jms.queue.testQueue</forwarding-address>
                <ha>true</ha>
                <reconnect-attempts>2</reconnect-attempts> 
                <failover-on-server-shutdown>false</failover-on-server-shutdown>
                <static-connectors>
                   <connector-ref>central1-connector</connector-ref>
                </static-connectors>
            </bridge>
            

             

            Unfortunately, it doesn't help: the bridge retries to connect to the cluster 2 times and then give up. It will never attempt to connect to the backup server.

            Here is what happens:

             

            Event: backup server (A-Backup) is started - bridge is notified but seems to "discard" the information.

             

            12:13:48,249 DEBUG [org.hornetq.core.client] ClientSessionFactoryImpl received backup update for live/backup pair = TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-6 / TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-7 but it didn't belong to TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6
            12:13:50,120 DEBUG [org.hornetq.core.client] ClientSessionFactoryImpl received backup update for live/backup pair = TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-6 / TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-7 but it didn't belong to TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6
            

             

             

            Event: liver server is killed (A-Live) - bridge connection to the cluster is closed. Bridge attempts to reconnect to A-Live (now dead) 2 times then give up.

             

            12:25:34,439 WARN  [org.hornetq.core.server] HQ222095: Connection failed with failedOver=false: HornetQNotConnectedException[errorType=NOT_CONNECTED message=HQ119006: Channel disconnected]
              at org.hornetq.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:422) [hornetq-core-client.jar:]
              at org.hornetq.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:878) [hornetq-core-client.jar:]
              at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:106) [hornetq-core-client.jar:]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_25]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_25]
              at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_25]
            
            
            12:25:34,442 DEBUG [org.hornetq.core.client] calling cleanup on ClientSessionImpl [name=92c00da6-204c-11e3-b3f4-05c1f6869b19, username=HORNETQ.CLUSTER.ADMIN.USER, closed=false, factory = ClientSessionFactoryImpl [serverLocator=ServerLocatorImpl (identity=Bridge my-bridge) [initialConnectors=[TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6], discoveryGroupConfiguration=null], connectorConfig=TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6, backupConfig=null], metaData=()]@30260a6d
            12:25:36,445 DEBUG [org.hornetq.core.client] Trying reconnection attempt 0/0
            12:25:36,446 DEBUG [org.hornetq.core.client] Trying to connect with connector = org.hornetq.core.remoting.impl.netty.NettyConnectorFactory@1cacdf69, parameters = {port=5455, host=172.16.1.6} connector = NettyConnector [host=172.16.1.6, port=5455, httpEnabled=false, useServlet=false, servletPath=/messaging/HornetQServlet, sslEnabled=false, useNio=false]
            12:25:36,447 DEBUG [org.hornetq.core.client] Started Netty Connector version 3.6.6.Final-90e1eb2
            12:25:36,447 DEBUG [org.hornetq.core.client] Trying to connect at the main server using connector :TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6
            12:25:36,447 DEBUG [org.hornetq.core.client] Remote destination: /172.16.1.6:5455
            12:25:36,466 DEBUG [org.hornetq.core.client] Main server is not up. Hopefully there's a backup configured now!
            12:25:36,467 DEBUG [org.hornetq.core.client] Could not connect to any server. Didn't have reconnection configured on the ClientSessionFactory
            12:25:38,469 DEBUG [org.hornetq.core.client] Trying reconnection attempt 0/0
            12:25:38,469 DEBUG [org.hornetq.core.client] Trying to connect with connector = org.hornetq.core.remoting.impl.netty.NettyConnectorFactory@1b3977a3, parameters = {port=5455, host=172.16.1.6} connector = NettyConnector [host=172.16.1.6, port=5455, httpEnabled=false, useServlet=false, servletPath=/messaging/HornetQServlet, sslEnabled=false, useNio=false]
            12:25:38,470 DEBUG [org.hornetq.core.client] Started Netty Connector version 3.6.6.Final-90e1eb2
            12:25:38,470 DEBUG [org.hornetq.core.client] Trying to connect at the main server using connector :TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6
            12:25:38,470 DEBUG [org.hornetq.core.client] Remote destination: /172.16.1.6:5455
            12:25:38,487 DEBUG [org.hornetq.core.client] Main server is not up. Hopefully there's a backup configured now!
            12:25:38,487 DEBUG [org.hornetq.core.client] Could not connect to any server. Didn't have reconnection configured on the ClientSessionFactory
            12:25:40,490 DEBUG [org.hornetq.core.client] Trying reconnection attempt 0/0
            12:25:40,490 DEBUG [org.hornetq.core.client] Trying to connect with connector = org.hornetq.core.remoting.impl.netty.NettyConnectorFactory@39c4d0cd, parameters = {port=5455, host=172.16.1.6} connector = NettyConnector [host=172.16.1.6, port=5455, httpEnabled=false, useServlet=false, servletPath=/messaging/HornetQServlet, sslEnabled=false, useNio=false]
            12:25:40,491 DEBUG [org.hornetq.core.client] Started Netty Connector version 3.6.6.Final-90e1eb2
            12:25:40,491 DEBUG [org.hornetq.core.client] Trying to connect at the main server using connector :TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6
            12:25:40,492 DEBUG [org.hornetq.core.client] Remote destination: /172.16.1.6:5455
            12:25:40,512 DEBUG [org.hornetq.core.client] Main server is not up. Hopefully there's a backup configured now!
            12:25:40,513 DEBUG [org.hornetq.core.client] Could not connect to any server. Didn't have reconnection configured on the ClientSessionFactory
            12:25:40,513 WARN  [org.hornetq.core.server] HQ222101: Bridge my-bridge achieved 3 maxattempts=2 it will stop retrying to reconnect
            
            • 3. Re: Core Bridge with target on live/backup doesn't failover
              clebert.suconic

              I suspect that node is not getting updates about the topology. Can you create a testcase that would work regardless of your environment? (work as in: fail regardless of your UDP env), so we could see the issue here

               

              You could use static cluster instead?

              • 4. Re: Re: Core Bridge with target on live/backup doesn't failover
                brenuart

                I suspect that node is not getting updates about the topology.

                The bridge node *DOES* received the topology updates but *REFUSES* to consider them.

                This is at least what I understand from the following log lines produced by the bridge node when the live server is killed on the other side:

                12:13:48,249 DEBUG [org.hornetq.core.client] ClientSessionFactoryImpl received backup update for live/backup pair = TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-6 / TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-7 but it didn't belong to TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6 
                12:13:50,120 DEBUG [org.hornetq.core.client] ClientSessionFactoryImpl received backup update for live/backup pair = TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-6 / TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-16-1-7 but it didn't belong to TransportConfiguration(name=central1-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5455&host=172-16-1-6 
                

                 

                What does "... but it doesn't belong to Transport..." actually means?

                 

                You could use static cluster instead?

                Static what? Static live/backup? You mean configure them such that UDP discovery isn't used anymore. Yes I can do that, but don't see how it would help. As I said, a standard Java remote client connecting to the live using JNDI will properly failover to the backup when the live is killed. And that remote client is not running on the same network and therefore doesn't see the broadcast announces (but is well aware of the topology change).

                 

                A workaround is to configure the bridge with two static connectors to both the live and the backup servers. This approach is working with HQ 2.3.8.Final but not with 2.3.0.Final. In that later version, HQ refuses to start if the backup node is not up and running. AFAIK, that issue has already been identified and fixed. Don't remember in which version but it is working in 2.3.8

                Unfortunately, this approach is "against" the usage of dynamic discovery of nodes since the bridge has to know them all in advance :-(

                 

                Can you create a testcase that would work regardless of your environment?

                Sure. How do you wan't me to "write" such test case? Do you need only the set of configuration files and the scenario to reproduce the problem or should it be "automated" in some ways?

                • 5. Re: Core Bridge with target on live/backup doesn't failover
                  clebert.suconic

                  > Sure. How do you wan't me to "write" such test case? Do you need only the set of configuration files and the scenario to reproduce the problem or should it be "automated" in some ways?

                   

                   

                  Anything where we can easily replicate your issue. (without any third party dependencies.. such as Oracle, MySQL... etc... something simple)

                   

                   

                  although an isolated Unit Testcase would be great...  anything where we can follow instructions is fine. Please include the versions you're using.. etc:

                   

                  Example:

                   

                  I - start server 1 using these config files, with arguments... etc, wildfly version X.Y.X (or if it's standalone.. start node A using config X.. etc)

                  II - start server2 with IP X...

                   

                   

                  *Try* to make it so we can use a single box to run it.   if it only happens with two box we can try but that usually means it's environmental?

                   

                   

                   

                   

                  Having you doing this for us will help us to identify what peculiar case from  your usage is making it happen. We have QE doing tests for this and we don't have any issues raised about this.

                  • 6. Re: Core Bridge with target on live/backup doesn't failover
                    brenuart

                    Sorry for the late answer...

                    I have been writing test cases and the problem is now... GONE. Everything works as expected.

                    The bad news is we still don't know why it failed before. We checked all the configurations we used, again and again, and couldn't find the reason.

                     

                    Still, we still have a couple of remarks/questions. I will open a separate discussion to handle them.

                    I will also post (soon) a summary of our working configuration in this post for "documentation" and help any one who may reach this thread while looking for answers to similar difficulties.

                     

                    In the meantime, thanks for your help and patience ;-)

                    • 7. Re: Core Bridge with target on live/backup doesn't failover
                      clebert.suconic

                      But did you make any changes to the configs... or it simply started to work?

                       

                      If you post the diffs maybe we can identify why.. just to see if we can improve things... at least document it if that was an expected case.

                       

                       

                      I will look for your next post as well

                      • 8. Re: Core Bridge with target on live/backup doesn't failover
                        brenuart

                        It didn't simply started to work by magic ;-)

                        We just wiped everything from our test environment and restarted from scratch with the configuration posted here above. And it worked... as expected, without any problem. Looks like we had something else in our environment (like env vars or modified startup scripts)...

                         

                        Anyway, thanks for your patience.

                         

                        /Bertrand