-
1. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
jbertram Aug 18, 2015 4:58 PM (in response to meabhi007)What are you specifically doing to simulate a network failure?
You should configure your connection-ttl and check-period to deal with any network interruptions, and you should ensure that the network connection between your replicated live and backup servers are extremely stable. If a live and its backup are separated from each other via some kind of network failure then once the connection-ttl elapses the backup will become live as you have observed. At this point you've got a "split brain" situation where clients could be interacting with each live server independently which means the data between the servers will no longer be synchronized. Rectifying this situation requires administrative intervention to decided which server is the "real" live server at that point. The "fake" live server would then need to be restarted.
-
2. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
meabhi007 Aug 18, 2015 5:52 PM (in response to jbertram)Hi Justin,
Thanks for Response.
You got it correct, we are trying to simulate network failure.
As of now, we are running servers in standalone mode,
do you think, if we use run these servers in domain mode (on different machines), we can avoid the split-brain situation?
We are afraid that keeping the dependency of manual intervention will not be reliable solution.
Please suggest if there is any way, through some APIs, we can detect the split-brain situation and can either push one JMS server in backup mode or restart it.
-
3. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
jbertram Aug 18, 2015 6:10 PM (in response to meabhi007)You got it correct, we are trying to simulate network failure.
Yes, I know. You said as much in your previous comment. I asked what specifically you were doing in order to simulate the network failure.
As of now, we are running servers in standalone mode,
do you think, if we use run these servers in domain mode (on different machines), we can avoid the split-brain situation?
Wildfly domain mode has nothing to do with HornetQ therefore I don't think using domain mode would help you avoid a split brain situation.
We are afraid that keeping the dependency of manual intervention will not be reliable solution.
Please suggest if there is any way, through some APIs, we can detect the split-brain situation and can either push one JMS server in backup mode or restart it.
There is no HornetQ API to detect and deal with a split brain situation because there is no way for HornetQ to know which server should actually be the "real" live server. That decision has to be made by someone (or something) that has knowledge about the data that's been changed on each of the servers and which server has the data that the application needs to function appropriately.
You can mitigate the split brain situation by increasing the size of your cluster as discussed in the documentation (see the last paragraph in the "Data Replication" section). The smaller the cluster the more likely a split brain situation becomes.
-
4. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
meabhi007 Aug 18, 2015 6:23 PM (in response to jbertram)We are using following commands in windows to simulate network outage (%1 - network interface type and %2 - time in seconds.):
@start netsh interface set interface %1 DISABLED
REM echo "starting sleep"
timeout /t %2
REM echo "sleep stopped"
@start netsh interface set interface %1 ENABLED
this will make sure that network on designated machine will stop for specific time interval
Thanks for your response.
-
5. Re: Wildfly 8.2 -Hornetq live-backup High availability server configuration is not working as expected on network failure.
amsinha Apr 5, 2016 11:19 AM (in response to jbertram)Hi,
What needs to be done to deal with the following scenario:
1. Two nodes running a 'live' and 'backup' server each
2. Node 1 is dead (power outage etc)
3. The 'backup' server on node 2 kicks in (as expected and all well up to now).
4. Node 1 is started back up and detects that there already is another live server.
This surely is the case that there is already another backup server that took over as live when node1 went down. And the question, why is the 'backup' on node2 not relinquishing control as live. Is there some specific configuration to control this behavior? I have "'check-for-live-server' as well as 'allow-failback' set to 'true' but that failback does not seem to be happening.
<<
08:15:45,702 WARN [org.hornetq.core.client] (hornetq-discovery-group-thread-dg-
group1) HQ212034: There are more than one servers on the network broadcasting th
e same node id. You will see this message exactly once (per node) if a node is r
estarted, in which case it can be safely ignored. But if it is logged continuous
ly it means you really do have more than one node on the same network active con
currently with the same node id. This could occur if you have a backup node acti
ve at the same time as its live node. nodeID=539a3c67-faa6-11e5-ac0d-99ef87c3195
4
>>