3 Replies Latest reply on Jun 17, 2004 6:19 PM by belaban

Cache is not synchronized in Cluster on reconnect

silviomatthes Jun 17, 2004 11:27 AM

We're testing with TreeCache regarding cluster-abilities.

The cache is not synchronized to a node that had no network connection and reconnects to the cluster.

- We have a cluster of 2 linux machines, jboss 3.2.4 (final) with Treecache that was delivered with jboss 3.2.4.
- Cache is configured as SYNCRONIZED and REPL_SYNC

Thats our scenario:
1.) Both machines (say A and B) are connected to each other.
2.) a Tx is started on machine A and committed that puts 3 members in the cache.
3.) the printdetails-function called on every machine shows the same values. fine.
4.) We unplug the network on machine B.
5.) Both machines recognize the disconnect (viewAccepted():...)
6.) We start a similar Tx as in 2.) on machine A which does a 'put' on 2 of the 3 members in the cache (so we modiy 2 members) and commit the Tx.
7.) We reconnect machine B to the network.
8.) Both machines recognize each other and build a cluster.

--> the printdetails-function called on every machine shows DIFFERENT values!
So the cache is not re-synchronized to machine B! Why?

We noticed 2 WARNing messages:

[NAKACK] [<machine B>] discarded message from non-member <machine A>
[NAKACK] [<machine B>] discarded message from non-member <machine A>

These messages appear on machine B after network-reconnect and BEFORE the "viewAccepted()"-message that says that the cluster is rebuild.
Maybe these messages should be processed in order to resynchronize the cluster?

1. Re: Cache is not synchronized in Cluster on reconnect

ben.wang Jun 17, 2004 11:47 AM (in response to silviomatthes)

Since the cache has started on Machine B already, it won't initiate another state transfer from other members when re-joining the group since it can be expensive operation. If you stop and start the cache on Machine B (say, from JMX console via MBean service), then it should sync up.

But I will discuss with Bela on this maybe adding this as an option.

Thanks,

-Ben
Actions
2. Re: Cache is not synchronized in Cluster on reconnect

silviomatthes Jun 17, 2004 11:54 AM (in response to silviomatthes)

Hi,
thanks for your answer. It would be nice to have such an option. Because otherwise we're getting problems with data inconsistency.

To do a workaround with stopping and starting the cache in such cases automatically, we first should know when a machine has no connection to the cluster anymore to react to it.
Is there some kind of function that is triggered when the clusternode-memberlist is changed (I mean when the viewAccepted()-message is displayed)?

Thanks in advance,

Silvio
Actions
3. Re: Cache is not synchronized in Cluster on reconnect

belaban Jun 17, 2004 6:19 PM (in response to silviomatthes)

What you essentially want is a state-merge function after e.g. a network partition. This is actually on the roadmap, but it involves asking you (the application) how to merge 2 (potentially) different substates back into one. We *cannot* just take the union of the 2 substates, because an application may want to do this differently.
The final solution will definitely involve a callback into the application to resolve this, probably we also ship with some default strategies.

Bela
Actions

Go to original post