5 Replies Latest reply on Mar 3, 2014 5:04 PM by andrew.morgan

Clustered replicated live-backup not failing back on live restart

andrew.morgan Feb 13, 2014 2:44 PM

I'm using JBoss EAP 6.2 (JBoss 7.3 and HornetQ 2.3.12_FINAL). I've configured a live-backup pair on two separate servers using the attached configurations. All multicast addresses have been changed to the same one as our backend is configured to allow only one. I've confimed mutlicast communication between the two servers.

The problem is the following:

1. Start live server - no problems

2. Start backup server - no problems, logs show replication from live to backup

3. Kill (or shutdown) live server - no problems, logs show HornetQ starting up and JMS objects being bound to JNDI

4. Restart live server: backup HornetQ server does not stop, "HQ212034: There are more than one servers on the network broadcasting the same node id" messages appear in the backup server log every 2 - 5 seconds until I shutdown the server.

I've looked at other's examples of configurations that are working and can't see anything missing or significantly different in mine.

standalone-full-ha.backup.xml 27.9 KB
standalone-full-ha.live.xml 27.8 KB

1. Re: Clustered replicated live-backup not failing back on live restart

jbertram Feb 13, 2014 3:37 PM (in response to andrew.morgan)
The server you're configuring as your "live" needs to have this:

<check-for-live-server>true</check-for-live-server>

See Chapter 39. High Availability and Failover.
1 of 1 people found this helpful
Actions

2. Re: Re: Clustered replicated live-backup not failing back on live restart

andrew.morgan Feb 13, 2014 4:34 PM (in response to jbertram)

Thanks for the quick response. My apologies, the uploaded file was not the only configuration I tried.

The first attempt was as follows:

Live

           <hornetq-server>
                <backup>false</backup>
                <check-for-live-server>true</check-for-live-server>
                <shared-store>false</shared-store>
                <failover-on-shutdown>true</failover-on-shutdown>
                <backup-group-name>TestingBackupGroup</backup-group-name>

Backup

            <hornetq-server>
                <backup>true</backup>
                <shared-store>false</shared-store>
                <failover-on-shutdown>true</failover-on-shutdown>
                <allow-failback>true</allow-failback>
                <backup-group-name>TestingBackupGroup</backup-group-name>

I've also tried with <allow-failback>true</allow-failback> on both nodes (as well as <check-for-live-server> on both).

3. Re: Re: Re: Clustered replicated live-backup not failing back on live restart

jbertram Feb 13, 2014 5:32 PM (in response to andrew.morgan)

As a sanity test, I checked this myself just now using EAP 6.2. I created 2 local instances starting with standalone-full-ha.xml on both. Here's what I added or modified:

Live:

                <shared-store>false</shared-store>
                <check-for-live-server>true</check-for-live-server>
                <cluster-password>${jboss.messaging.cluster.password:secret}</cluster-password>

Backup:

                <shared-store>false</shared-store>
                <backup>true</backup>
                <allow-failback>true</allow-failback>
                <max-saved-replicated-journal-size>10</max-saved-replicated-journal-size>
                <cluster-password>${jboss.messaging.cluster.password:secret}</cluster-password>

Then I:

Started the live.
Started the backup.
Killed the live (i.e. via kill -9 <pid>)
Observed the backup take-over.
Started the live.
Observed the backup cede to the live and become a backup again.

I repeated steps 3-6 several times. I did not observe any problems.

Do these steps work for you?

4. Re: Re: Re: Clustered replicated live-backup not failing back on live restart

andrew.morgan Feb 14, 2014 10:36 AM (in response to jbertram)
Yes, that works perfectly both running two instances on my workstation (starting without a -b parameter) and running one on my workstation and another on my co-workers workstation (using -b $HOSTNAME).

The differences between these tests are several:
The workstation tests are physical devices, the original tests were VMs
The workstations have a single physical interface and VMs have 3 virtual ones.
The workstations are connected to a physical switch and the VMs are bridged on the host (I think).
I did a second test between the two workstations with all of the multicast addresses set to the same value (231.7.7.7) to eliminate that as a problem with the VM configuration.
I'll talk to the sysadmin who set up the VMs to see how they are connected and try to determine if the problem is there.
Actions
5. Re: Re: Re: Clustered replicated live-backup not failing back on live restart

andrew.morgan Mar 3, 2014 5:04 PM (in response to andrew.morgan)
Back from my vacation and a few hours of testing, and the problem is resolved.
Ran tests with tcpdump to examine the heartbeats passing between the servers. The UDP messages between live and backup appeared to be the same and fragmented no matter which direction they were travelling in, even though one hornetq could receive them and the other one couldn't.

In addition to disabling multicast snooping on the HOST server as described in https://bugzilla.redhat.com/show_bug.cgi?id=880035,
echo 0 > /sys/class/net/virbr0/bridge/multicast_snooping
turning off udp-fragmentation-offload on the interface of the GUEST virtual machines allowed either server to see the other's heartbeat:
ethtool -K eth0 ufo off

Because the error was asymmetrical and the servers should have been identically configured, we will attempt the test again with freshly deployed VMs to determine if this is the case for all setups or if some undocumented configuration change on one of the two servers was causing the problem.
Actions

Go to original post