4 Replies Latest reply on Jul 7, 2016 2:15 PM by wdfink

Incorrect cluster re-balance by reconnecting node after network outage

sjaiswal Jun 30, 2016 2:10 PM

Hi All,

I have been working on a complex system of embedded cache and we are multiple issues regarding incorrect re-balance and split brain scenarios.

below are the details of how application is current working.

infinispan verison 8.1

wildfly version 10

java version 1.8

jgroups transport 3.6

lets assume there are n numbers of nodes.

1. only one node can write data to caches through application.

2. num_owners of caches is default as 2.

3. caches are distributed to serve more clients ( clients are listening to cache all the time ).

4. writing/updating of data on caches can happen any point of time.

5. some of caches are made as tree cache programmatically while others are not.

6. transaction is used as pessimistic for tree caches with auto-commit and batch mode.

7. locking is used as serializable with default settings.

8. versioning is used as simple for all the caches

Description of issue

example 1 .

conditions : one of node had network outage for some time, meanwhile caches had been updated by the desired and has been reblanced with other remanding nodes successfully.

issue : node under network outage comes back online

result 1 : cluster view ends up with cluster splits with incorrect data on more than one node. or

result 2 : re joining node is able to join back original cluster but ends up corrupting data on one of more than one nodes.

expected behavior : re joining node should be able to join and update to the latest data on original cluster and serve the clients latest cache entries.

can you please suggest how can we control this situation. we have tried out almost all of the configurations. As per behavior restart can solve formation of cluster splits but its not an option to do so.

i can provide configurations if needed. please let me know.

any suggestions are welcome

1. Re: Incorrect cluster re-balance by reconnecting node after network outage

rvansa Jul 1, 2016 8:06 AM (in response to sjaiswal)

Do you have partition handling enabled? With this setting, the node should wipe out its data when re-joining the cluster (when it sees that it has been removed from the cluster as inactive).
Actions
2. Re: Incorrect cluster re-balance by reconnecting node after network outage

sjaiswal Jul 1, 2016 11:26 AM (in response to rvansa)

I have tried that too Radim.

The behavior was after rejoining of main cluster back it has also wiped data on original members.
this was checked without versioning, will update with behavior again with versioning enabled as simple and partition handling enabled.

Thanks for a lead.
Actions
3. Re: Incorrect cluster re-balance by reconnecting node after network outage

rvansa Jul 7, 2016 4:54 AM (in response to sjaiswal)

Data are wiped out only on the non-available partition when joining to available partition. If you see different behaviour, it's a bug. Could you write a reproducer? infinispan/PartitionHappeningTest.java at 7fa26cef1c9189c36da7694fb3f0ff0acf3bfd82 · infinispan/infinispan · GitHub should give you some clues how to simulate a partition.
Actions
4. Re: Incorrect cluster re-balance by reconnecting node after network outage

wdfink Jul 7, 2016 2:15 PM (in response to sjaiswal)

With PartitionHandling enabled it should not wipe data from an AVAILABLE partition. If there is no partition like this (all are DEGRADED) none of the partitions will be wiped-out, it will merge all data.
This is according to the PartitionHandling definition.

If there is something unexpected please show the case and provide logfiles
Actions

Go to original post