Buddy replication should not be used in a situation where you're expecting multiple servers to be concurrently modifying the same node. It's meant for use cases where one server owns the data.
Buddy replication combined with INVALIDATION doesn't make sense. Invalidation means, "I have the latest data; you may be out of date, so throw away your data." Sending such a message to a limited subset of the cluster doesn't make sense.
Is it for efficiency or corectness reasons?
i can imagine put with auto gravity option on as an atomic operation of subsequently:
get() resulting in gravity of data
put() performed 'locally'
than according to my understanding the get() has to remove the node from other servers when the dataGravitationRemoveOnFind set on true - that's where the question about the difference between INVAL and REPL came from.
Yet what if I invoke concurrently get() on 2 or more servers. Do I have any guarantees that at the end I will have only one main copy of the node?
If you do a put with a local option, it won't replicate to anyone, so the node that did the replication will be out of sync with the buddies.
As to multiple nodes simultaneously doing a put on the same node, here's what happens. I'm assuming the node already exists.
Assume no tx running. The data in question is stored on server0 and it's buddy group.
1) You do a put() on server 1. Simultaneously a put() on server0.
2) DataGravitatorInterceptor.1 and DataGravitatorInterceptor.2 both see the node doesn't exist; fetches the node's data from across the cluster.
3) DataGravitatorInterceptor.1 and .2 take the data and do a put (not local). This replicates the data to its buddies. No tx, so no lock is held on the node. At this point there are three copies of the data -- the server0 group's, the server1 group's and the server2 group's.
4) DataGravitatorInterceptor.1 and .2 send a cleanup call to the cluster. Any copy of the data not associated with the sending server's buddy group is removed.
5) The original puts go through.
The end result here will very much depend on how things get interleaved. With REPL_SYNC you could end up with a TimeoutException in Step 4 as server1 and server2 tell each other to remove the data and deadlock. Or server1 completes steps 3-5 and then server 2 executes steps 3-5, in which case server 2's change wins. Or both complete step 3, then server 1 completes step 4 (so the server 0 and server 2 copies are gone), then server 2 completes step 4 (so the server 1 copy is gone). Then the both complete step 5, resulting in 2 sets of data, each of which only has the key/value pair included in the put.
Now, if there is a tx in place:
The put() in step 3 is done in a tx, so a write lock will be held on the node on each server until the tx commits. The put will not replicate until the tx commits.
The removes in step 4 will also not be broadcast until the tx commits.
The put in step 5 will not be replicated until the tx commits.
The fact that the WL from step 3 is held should make steps 3-5 atomic. If it's REPL_SYNC, you have two servers trying to write to the same node, so it's possible when the tx tries to commit you'll get a TimeoutExceptio due to a lock conflict. With REPL_ASYNC, the later tx will win; the step 5 put from the earlier tx will be lost.
But.. while writing this I'm pretty sure I've spotted a bug in the tx case. The step 4 cleanup call gets bundled together with the other tx changes and therefore only gets replicated to the server's buddy's, not to the whole cluster.