JBAS-2142 and JBCACHE-273 Discussion Thread| JBoss.org Content Archive (Read Only)

15. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

starksm64 Aug 29, 2005 8:18 PM (in response to brian.stansberry)

TreeCache doesn't need a notfication; it just has to expose a method that RpcDispatcher can call. (although if someone can think of a use case for a notification I'd feel better about the call).

The case I am thinking about is that I have a CacheLoader that is listening for notifications while the node is inactive to know whether or not its out of synch. Maybe this is already handled via other api already.

Maybe jgroups should just have a noop MethodCall signified by a null method or something to support this short circut behavior.

16. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

manik Aug 30, 2005 7:58 AM (in response to brian.stansberry)

"bstansberry@jboss.com" wrote:

Not sure exactly what JBCACHE-60 means; is the idea that different nodes in the cluster are (solely??) responsible for different portions of the tree? This seems like it could have overlap with JBCACHE-273.

When I look at JBCACHE-61, I think it's focus is on limiting normal replication traffic to just a subset of the cluster nodes, but between those nodes the entire tree would be kept consistent. Correct? We'd certainly need to think this through if we also nodes to inactivate portions of the tree -- if a node's buddy inactivated part of the tree, there would then be no backup of that part of the tree.

Thinking about it in greater detail, there should be no overlaps here except as you pointed out, where a node decides to inactivate data it is behaving as a primary or backup source for.

It is probably not something that affects the semantics or design of partial state transfer though, just something I'd need to keep in mind when designing buddy replication/partitioned caches. Something I will make a note of on ( http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3892211#3892211 ) for now.

Cheers,
Manik

17. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

brian.stansberry Sep 2, 2005 7:19 PM (in response to brian.stansberry)

Currently, the payload of a state tranferring includes a byte[] of the transient state and a byte[] of the persistent state. The internal format of these two byte[] is different. I was asked to look into whether it makes any sense to unify the binary format of the transient state and persistent state.

I believe it makes sense to leave things as they are.

The transient state byte[] is created by marshalling the root DataNode of the in memory tree. When unmarshalled, the root node and all descendant nodes are already in a tree structure. Clearly this is efficient for the transient state -- the node providing state just makes a single marshal call and on the other side the unmarshalled tree structure can easily be inserted into the tree.

The persistent state byte[] is created by marshalling individual nodes from the tree, adding them one after the other to the byte[] with no attempt to link them together. This is efficient because the only code that handles the byte[] are the CacheLoaders on the sending and receving ends. CacheLoaders naturally work with individual nodes (e.g. storing/retrieving a single node's attribute map in/from a database record or file) and use the node's Fqn to figure out parent-child relationships. Looking at both FileCacheLoader and JDBCCacheLoader, I see no way in which they could benefit from organizing the nodes into a tree structure.

With respect to partial state transfer, the existing mechanism should work well; the only different with a partial transfer is that the root node from which work begins can be something other than "/".

18. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

belaban Sep 3, 2005 7:40 AM (in response to brian.stansberry)

Don't we have to change the state transfer anyway because of the classloader issues Ben's been having ?
Otherwise, at one point we should get rid of Nodes implementing Externalizable, and do our own marshalling. This would make the state to be transferred smaller, and marshalling/unmarshalling faster.
I measured 40% size reduction going from Externalizable to Streamable in JGroups (2.2.7 --> 2.2.8).

19. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

brian.stansberry Sep 22, 2005 11:57 PM (in response to brian.stansberry)

Locking Issues

I've been working to resolve deadlock issues that arise from my attempts to ensure the state a cluster node is using after a partial state transfer is consistent even though other cluster nodes are concurrently modifying the subtree.

Simple approach I took to this was, on the node requesting state:

1) Create the root node of the subtree.
2) Lock it with a write lock.
3) Inform the TreeCacheMarsaller it can begin responding to messages for the subtree. Any messages that come in will block, as the node is locked.
4) Request the state transfer and integrate the response.
5) Release the write lock; any pending messages will go through.

This should allow consistent state, as any puts/removes that come in while the state transfer is in progress can be correctly applied.

Problem is it's prone to deadlock -- node requesting state transfer has a write lock on the subtree, node providing state has a write lock on some node in the subtree.

When deadlock happens, the state transfer eventually succeeds as in the default config the state transfer timeout is longer than a replication message timeout. But, whenever this happens, the whole subtree is frozen until the deadlock resolves. In the case of http session replication, that would mean locking a webapp for roughly 10 seconds.

I'm thinking of three possible approaches to mitigating this:

1) Use the current approach, but tighten the code to reduce the time the lock is held during state transfer. The underlying problem will still exist, but odds of occurence will be reduced. Not very satisfying.

2) Use the approach I originally discussed -- i.e. create and lock node, do state transfer, enable marshaller, unlock node. There will be no deadlocks this way, but any events sent between the time the sending node locks its subtree and when the marshaller is enabled on the recipient will be lost for the recipient.

3) Add a queuing mechanism to TreeCache, i.e.
a) Create subtree root.
b) Notify TreeCacheMarshaller to begin passing incoming messages to the TreeCache's queuing method.
c) Do the state transfer.
d) Process any enqueued messages.
e) Notify marshaller to begin normal message handling.

This approach should allow consistent state, but essentially imposes asynchronous semantics on the enqueued messages. I hesitantly think that's OK, as the node receiving partial state is not really fully active for the subtree until normal message handling begins.

I'm inclined to go with the 3rd approach. Any thoughts on this?

20. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

belaban Sep 25, 2005 6:39 AM (in response to brian.stansberry)

I think #3 is correct. However, what is the rationale for acquiring a write lock ? Can't you use a read lock ? Because if someone wants to use READ_UNCOMMITTED, they should not block on state transfer.

21. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

brian.stansberry Sep 25, 2005 10:40 PM (in response to brian.stansberry)

I think #3 is correct

I implemented it and it works well (i.e. my unit tests that do transfers under concurrent load now pass).

I think the key thing with TreeCache is the RPC calls are idempotent, so if you queue up some calls but they turn out to have already been applied to the state you get transferred, there is no harm in reapplying them. You just don't want to miss any calls or get them out of order. This is much easier than partial state transfer in JGroups, where messages that change state may not be idempotent.

However, what is the rationale for acquiring a write lock ? Can't you use a read lock ? Because if someone wants to use READ_UNCOMMITTED, they should not block on state transfer.

Good point. I used write because I was writing to the node, but a read lock is sufficient.

22. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

brian.stansberry Oct 5, 2005 3:36 AM (in response to brian.stansberry)

Issues with Transactions

Found some interesting issues related to transactions that I'll briefly comment on here to add to the design record.

1) Commit message replication. TreeCacheMarshaller was not writing an Fqn at the start of the byte[] for commit or rollback method calls. This was because the fqn to write was not readily available from the method call itself.

This caused a problem for receiving nodes w/ inactive regions, because without the Fqn, the TreeCacheMarshaller would allow the commit/rollback method call to execute after earlier rejecting the prepare call, which would lead to exceptions, errors in the log.

Fix was to keep a map in TreeCacheMarshaller of GTX --> Fqn. Entry gets added to the map for each prepare call, and removed for each commit/rollback. The map allows the writing of the fqn to the byte[] for commit/rollback calls, where the GTX is available.

2) Node is activating a region. Before this starts, it discards a prepare message for the region. Then it enqueues the commit. Exception is thrown when it processes the queue, as the ReplicationInterceptor can't find the local_tx from the prepare.

Fix was, when processing the queue, keep track of the GTX from any prepare call. If a commit/rollback is found, but its GTX wasn't registered, go ahead and invoke the method, but catch and discard the exeption. I invoke the commit/rollback in the very remote chance that the prepare was properly executed before queueing began, but the more I think about it the more I think I'll just discard the method call.

3) TreeCacheAop issue. Node activates region /A. Before any region is activated, TreeCacheAop checks if the __JBoss_Internal__ region is active; if not it activates it. This ensures that the state needed to support shared object references is available.

Another node does some puts in region /B. These are ignored by the first node, as /B is not active on that node.

Now node activates region /B. Partial state for /B is transferred, but not for __JBoss_Internal__, which was already activated.

Problem is the __JBoss_Internal__ state associated with /B was included in various prepare() messages for /B that were discarded. Now __JBoss_Internal__ has incomplete state.

Solution was to add a 3rd inner byte[] to the state transfer data -- transient state, persistent state, and a new one, associated state. Added two methods to TreeCache

protected byte[] _getAssociatedState(Fqn fqn)

protected void _setAssociatedState(Fqn fqn, byte[] state)

These are called by the _getState and _setState methods respectively to handle the 3rd byte[].

In TreeCache, these methods are no-ops. In TreeCacheAop they are overridden to transfer and integrate the relevant data from the __JBossInternal__ area.

23. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

brian.stansberry Oct 14, 2005 1:13 AM (in response to brian.stansberry)

Inactive Regions

When a portion of the tree has been inactivated via a call to inactivateRegion(), all this does is prevent execution of any remote replication method calls on that region. It does not prevent local users of the cache from putting objects in the inactive region. If any objects were put in the inactive region, the puts would also be replicated to other nodes in the cluster.

This raises a couple of issues:

1) Should we prevent local activity on an inactive region? This could be done by adding an interceptor. I'm not sure it's worth the overhead. If we did add such an interceptor, it would probably need to be in 1.3.

2) If we don't prevent local activity, what should we do when the user activates the region? I originally coded activateRegion() to check for a "non-empty" region and throw an exception if found. But now I'm not so sure. No matter what, when a region is activated, we do a state transfer for that region from another node and completely replace any existing state with the transferred state. So what's the harm if there was already state there? I'm thinking now that the presence of existing state justifies logging a WARN, not an exception.

Probably the WARN vs. exception question in #2 comes down to what we want to do in the future re: #1. If in 1.3 we are going to prevent local activity on the inactive region, leaving the exception in activateRegion() in 1.2.4 makes sense; it will discourage people from writing apps that do something we will prevent in 1.3.

24. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

ben.wang Oct 24, 2005 11:07 AM (in response to brian.stansberry)

This is somewhat outside of this topic. But so far in JBossCache, here are the aspects that are not totally orthognal: eviction, cacheloader, (passivation), state transfer. There may be more on the way in the future when the cache feature set is expanded. Potentially this can become a problem down the road with too much inter-dependency then.

25. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

ben.wang Oct 24, 2005 11:11 AM (in response to brian.stansberry)

Brian, what you suggest of "inactive region" then implies that region should have it's own life cycle. In 2.0, we will elevate the region to the first level construct. This may make sense then?

26. Re: JBAS-2142 and JBCACHE-273 Discussion Thread

brian.stansberry Oct 24, 2005 6:51 PM (in response to brian.stansberry)

Adding an interceptor (or some other approach to preventing local code modifying an inactive region) as part of the general 2.0 work on regions makes sense.

As things stand now, the existing activateRegion()/inactivateRegion() methods effectively give the relevant subtree its own lifecycle. Particularly since inactivateRegion() evicts all nodes in the subtree from the cache and activateRegion() will throw an exception if it finds existing data in the subtree being activated. For now these two facts should be enough to prevent people from writing apps that try to locally modify an inactive region. So I see no rush to add something that formally prevents it; doing it in 2.0 sounds fine.