3 Replies Latest reply on Jan 26, 2007 3:54 AM by manik

Behaviour of the cache in case of a node failure

lucdew Jan 24, 2007 4:50 PM

Hi,

I have a newbie question. I'd like to know how the Pojo cache (1.4.1) behaves in case of failure of a node in a 2 nodes cluster.

Our Pojo cache is configured to perform synchronous replication and to be transactional (the JTA transaction manager of our application server is used).

When a node is down and we call transactional methods that update objects in the cache, everything works fine without any error.

So i guess there's a node state detection implemented as a multicast heartbeat or something equivalent. But:
- what would be the behavior if an update of a cached object is done on a node and at the same time the other node crashes or can not be accessed ? Reading the docs it says that an exception is thrown (of which type ?)
and the transaction is rolled back.
- How does Jboss cache make the difference between a shutdown node and node that can not be reached (synchronisation can not be performed) due to network failure or a Jvm freeze ?

Thanks in advance.

1. Re: Behaviour of the cache in case of a node failure

manik Jan 25, 2007 11:51 AM (in response to lucdew)

If a node fails during the 2-phase commit protocol, an exception is thrown internally by the comms layer. This is trapped and the transaction is marked for rollback, and the TM sees this and initiates a rollback.

If the node does before the 2-pc protocol commences, then the tx does not fail; it just excludes the dead node from the cluster.

There is no difference between a shutdown and a node crashing as far as the 2-pc protocol is concerned. If the node dies/is shut down in the middle of the 2-pc protocol, it will cause a rollback on the initiator of the commit. Else, it will be excluded from the cluster before the 2-pc commences.

We detect dead nodes using JGroups' failure detection protocols - FD and FD_SOCK. See JGroups docs for details on this.
Actions
2. Re: Behaviour of the cache in case of a node failure

lucdew Jan 25, 2007 6:15 PM (in response to lucdew)

Thanks a lot for the explanations.
Reading the Junit test cases also helped.

I also have a question regarding (re)synchronization.
For a cluster of 2 cache members A & B, imagine that:
- A network failure occurs and members A & B can't "see" each other.
- a transaction starts and updates the node "/node1" on member A and commits
- a transaction starts and updates the exact same node on member B and commits.

Then network is back again and nodes can now see each other. What would happen regarding data consistency ? I guess node versioning or timestamping is used to determine which node that should prevail when synchronizing members but if the same node is updated on both members ? Is it "the last wins" rule ?
Actions
3. Re: Behaviour of the cache in case of a node failure

manik Jan 26, 2007 3:54 AM (in response to lucdew)

This is actually will have an indeterminate outcome.

It will depend on a number of factors, including how the network heals. JGroups again has a lot of documentation on merging. The problem with the 'split brain' scenario you describe is that there is no way of knowing which version of the data is the more up-to-date one. Timestamps are of relatively little value (unless you have sync clocks, but even these could have gone out of sync during the disconnect).

The only thing you can be sure of after such an event is that a subsequent write to the node (on either member) will bring them both in sync again.
Actions

Go to original post