-
1. Re: Clustered replicated live-backup not failing back on live restart
jbertram Feb 13, 2014 3:37 PM (in response to andrew.morgan)1 of 1 people found this helpfulThe server you're configuring as your "live" needs to have this:
<check-for-live-server>true</check-for-live-server>
-
2. Re: Re: Clustered replicated live-backup not failing back on live restart
andrew.morgan Feb 13, 2014 4:34 PM (in response to jbertram)Thanks for the quick response. My apologies, the uploaded file was not the only configuration I tried.
The first attempt was as follows:
Live
<hornetq-server> <backup>false</backup> <check-for-live-server>true</check-for-live-server> <shared-store>false</shared-store> <failover-on-shutdown>true</failover-on-shutdown> <backup-group-name>TestingBackupGroup</backup-group-name>
Backup
<hornetq-server> <backup>true</backup> <shared-store>false</shared-store> <failover-on-shutdown>true</failover-on-shutdown> <allow-failback>true</allow-failback> <backup-group-name>TestingBackupGroup</backup-group-name>
I've also tried with <allow-failback>true</allow-failback> on both nodes (as well as <check-for-live-server> on both).
-
3. Re: Re: Re: Clustered replicated live-backup not failing back on live restart
jbertram Feb 13, 2014 5:32 PM (in response to andrew.morgan)As a sanity test, I checked this myself just now using EAP 6.2. I created 2 local instances starting with standalone-full-ha.xml on both. Here's what I added or modified:
Live:
<shared-store>false</shared-store> <check-for-live-server>true</check-for-live-server> <cluster-password>${jboss.messaging.cluster.password:secret}</cluster-password>
Backup:
<shared-store>false</shared-store> <backup>true</backup> <allow-failback>true</allow-failback> <max-saved-replicated-journal-size>10</max-saved-replicated-journal-size> <cluster-password>${jboss.messaging.cluster.password:secret}</cluster-password>
Then I:
- Started the live.
- Started the backup.
- Killed the live (i.e. via kill -9 <pid>)
- Observed the backup take-over.
- Started the live.
- Observed the backup cede to the live and become a backup again.
I repeated steps 3-6 several times. I did not observe any problems.
Do these steps work for you?
-
4. Re: Re: Re: Clustered replicated live-backup not failing back on live restart
andrew.morgan Feb 14, 2014 10:36 AM (in response to jbertram)Yes, that works perfectly both running two instances on my workstation (starting without a -b parameter) and running one on my workstation and another on my co-workers workstation (using -b $HOSTNAME).
The differences between these tests are several:
- The workstation tests are physical devices, the original tests were VMs
- The workstations have a single physical interface and VMs have 3 virtual ones.
- The workstations are connected to a physical switch and the VMs are bridged on the host (I think).
I did a second test between the two workstations with all of the multicast addresses set to the same value (231.7.7.7) to eliminate that as a problem with the VM configuration.
I'll talk to the sysadmin who set up the VMs to see how they are connected and try to determine if the problem is there.
-
5. Re: Re: Re: Clustered replicated live-backup not failing back on live restart
andrew.morgan Mar 3, 2014 5:04 PM (in response to andrew.morgan)Back from my vacation and a few hours of testing, and the problem is resolved.
Ran tests with tcpdump to examine the heartbeats passing between the servers. The UDP messages between live and backup appeared to be the same and fragmented no matter which direction they were travelling in, even though one hornetq could receive them and the other one couldn't.
In addition to disabling multicast snooping on the HOST server as described in https://bugzilla.redhat.com/show_bug.cgi?id=880035,
echo 0 > /sys/class/net/virbr0/bridge/multicast_snooping
turning off udp-fragmentation-offload on the interface of the GUEST virtual machines allowed either server to see the other's heartbeat:
ethtool -K eth0 ufo off
Because the error was asymmetrical and the servers should have been identically configured, we will attempt the test again with freshly deployed VMs to determine if this is the case for all setups or if some undocumented configuration change on one of the two servers was causing the problem.