5 Replies Latest reply on Apr 5, 2016 11:19 AM by Amit Sinha

    Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.

    Abhishek Abhishek Newbie

      Hi,

       

      I am configuring the JMS cluster environment in Wildfly 8.2 - hornetQ.

      journals are in replicated mode between active -backup servers with   <shared-store>false</shared-store>

       

      If we are terminating / shutting down the active server than backup server becomes live and registers the queues and topics as expected.

      but when we are simulating the network failure on active server (lets say for 1 min) then facing following issue:

      ************

                With default configurations for "connection-ttl" and "check-period" for "cluster-connections" node, backup server detects the network failure with primary server and becomes live.

                Now when network for active server comes up, active and backup both are broadcasting the messages with Same node id.

                We can see the logs with following strings:

      HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node.'

      *********

      We understand that for replicated journals, when back-up server becomes live, it uses same node id to broadcast the messages and that's why above message is getting logged continuously.

      Please suggest, if we can handle this through some configurations to keep only one server active.

       

      My expectation is that something similar to "allow-failback" configuration should kick in and send notification to backup server to go in backup mode again.

        • 1. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
          Justin Bertram Master

          What are you specifically doing to simulate a network failure?

           

          You should configure your connection-ttl and check-period to deal with any network interruptions, and you should ensure that the network connection between your replicated live and backup servers are extremely stable.  If a live and its backup are separated from each other via some kind of network failure then once the connection-ttl elapses the backup will become live as you have observed.  At this point you've got a "split brain" situation where clients could be interacting with each live server independently which means the data between the servers will no longer be synchronized.  Rectifying this situation requires administrative intervention to decided which server is the "real" live server at that point.  The "fake" live server would then need to be restarted.

          • 2. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
            Abhishek Abhishek Newbie

            Hi Justin,

             

            Thanks for Response.

            You got it correct, we are trying to simulate network failure.

             

            As of now, we are running servers in standalone mode,

            do you think, if we use run these servers in domain mode (on different machines), we can avoid the split-brain situation?


            We are afraid that keeping the dependency of manual intervention will not be reliable solution.

            Please suggest if there is any way, through some APIs, we can detect the split-brain situation and can either push one JMS server in backup mode or restart it.

            • 3. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
              Justin Bertram Master

              You got it correct, we are trying to simulate network failure.

              Yes, I know.  You said as much in your previous comment.  I asked what specifically you were doing in order to simulate the network failure.

               

              As of now, we are running servers in standalone mode,

              do you think, if we use run these servers in domain mode (on different machines), we can avoid the split-brain situation?

              Wildfly domain mode has nothing to do with HornetQ therefore I don't think using domain mode would help you avoid a split brain situation.

               

              We are afraid that keeping the dependency of manual intervention will not be reliable solution.

              Please suggest if there is any way, through some APIs, we can detect the split-brain situation and can either push one JMS server in backup mode or restart it.

              There is no HornetQ API to detect and deal with a split brain situation because there is no way for HornetQ to know which server should actually be the "real" live server.  That decision has to be made by someone (or something) that has knowledge about the data that's been changed on each of the servers and which server has the data that the application needs to function appropriately.

               

              You can mitigate the split brain situation by increasing the size of your cluster as discussed in the documentation (see the last paragraph in the "Data Replication" section).  The smaller the cluster the more likely a split brain situation becomes.

              • 4. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
                Abhishek Abhishek Newbie

                We are using following commands in windows to simulate network outage (%1 - network interface type and %2 - time in seconds.):

                 

                @start netsh interface set interface %1 DISABLED

                REM echo "starting sleep"

                timeout /t %2

                REM echo "sleep stopped"

                @start netsh interface set interface %1 ENABLED

                this will make sure that network on designated machine will stop for specific time interval

                 

                Thanks for your response.

                • 5. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
                  Amit Sinha Newbie

                  Hi,

                   

                  What needs to be done to deal with the following scenario:

                   

                  1. Two nodes running a 'live' and 'backup' server each

                  2. Node 1 is dead (power outage etc)

                  3. The 'backup' server on node 2 kicks in (as expected and all well up to now).

                  4. Node 1 is started back up and detects that there already is another live server.

                   

                  This surely is the case that there is already another backup server that took over as live when node1 went down. And the question, why is the 'backup' on node2 not relinquishing control as live. Is there some specific configuration to control this behavior? I have "'check-for-live-server' as well as 'allow-failback' set to 'true' but that failback does not seem to be happening.

                   

                  <<

                  08:15:45,702 WARN  [org.hornetq.core.client] (hornetq-discovery-group-thread-dg-

                  group1) HQ212034: There are more than one servers on the network broadcasting th

                  e same node id. You will see this message exactly once (per node) if a node is r

                  estarted, in which case it can be safely ignored. But if it is logged continuous

                  ly it means you really do have more than one node on the same network active con

                  currently with the same node id. This could occur if you have a backup node acti

                  ve at the same time as its live node. nodeID=539a3c67-faa6-11e5-ac0d-99ef87c3195

                  4

                  >>