1 2 Previous Next 17 Replies Latest reply on Jul 2, 2012 7:14 PM by dex80526

cache put stucks forever in a 3 node cluster

dex80526 Jun 22, 2012 6:48 PM

I am banging my head against wall these days, and can not see what is going on with infinispan replication.

Here is my problem:

I have a 3 node cluster in (ASYNC) replication mode usging TCP with ISPN 5.1.4.Final.

I have a cache configured a jdbc cache store using Derby. I made the cache is "NO_TRANSACTIONAL".

During a test, I am loading 10K entries into the cache from one node ( and they should be stored to cache store and replicate to other 2 nodes in ASYNC). I found that the loading will be stuck after a few hundred entries (various), but never complete the load, and there is no Exception/Error in the log on both loading node and other 2 nodes. I turn on trace, and did not see any obvious problems from my eyes.

However, the same code and same ISPN configuration but with 2 node cluster, the load competes in less than one minute.

I tried to use "SKIP_LOCK" and/or "PUT_FOR_EXTERNAL_READ" flags and putAsync(), and did not see much difference.

My testing code looks like this:

....

try{

ObjectInputStream objectInputStream = new ObjectInputStream(inputStream);

Object readObject;

while ((readObject = objectInputStream.readObject()) != null){

if (readObject instanceof UserProfileData){

UserProfileData userProfileData = (UserProfileData) readObject;

cach.put(userProfileData.getUserId(),userProfileData);

importExportResults.messages.append("Imported ID: " + userProfileData.getUserId() + "\n");

importExportResults.numSuccessfulItems++;

} else {

importExportResults.numWarnings++;

}

} catch (Exception e){

if (e instanceof EOFException){

importExportResults.messages.append("Got EOF, assuming finished\n");

} else {

importExportResults.numErrors++;

}

Here is snippet of trace logs from the node is loading:

....

012-06-22/14:12:14.722/EDT [CoalescedAsyncStore-48] TRACE org.infinispan.loaders.jdbc.connectionfactory.ManagedConnectionFactory[99] - Connection checked out: ProxyConnection[PooledConnection[org.apache.derby.impl.jdbc.EmbedConnection40@1739265564 (XID = 163), (SESSIONID = 9), (DATABASE = /usr/local/test/cacheData/user/UserProfileDB), (DRDAID = null) ]]

2012-06-22/14:12:14.722/EDT [CoalescedAsyncStore-48] TRACE org.infinispan.loaders.jdbc.connectionfactory.ManagedConnectionFactory[99] - Connection checked out: ProxyConnection[PooledConnection[org.apache.derby.impl.jdbc.EmbedConnection40@1739265564 (XID = 163), (SESSIONID = 9), (DATABASE = /usr/local/test/cacheData/user/UserProfileDB), (DRDAID = null) ]]

2012-06-22/14:12:14.723/EDT [CoalescedAsyncStore-55] TRACE org.infinispan.util.concurrent.locks.StripedLock[131] - WL released for '9314e7b09e3623e128a5496e88feb36bf42db087'

2012-06-22/14:12:14.723/EDT [CoalescedAsyncStore-55] TRACE org.infinispan.loaders.LockSupportCacheStore[212] - exit store(ImmortalCacheEntry{key=9314e7b09e3623e128a5496e88feb36bf42db087, value=ImmortalCacheValue {value=com.test.service.shared.cache.cacheables.UserProfileData@58376d08}})

2012-06-22/14:12:14.723/EDT [CoalescedAsyncStore-55] TRACE org.infinispan.loaders.decorators.AsyncStore[427] - Release lock for key 9314e7b09e3623e128a5496e88feb36bf42db087

2012-06-22/14:12:14.724/EDT [CoalescedAsyncStore-48] TRACE org.infinispan.util.concurrent.locks.StripedLock[131] - WL released for 'c415505dca69be631ca5d391b3ccd2b44b52d017'

2012-06-22/14:12:14.724/EDT [CoalescedAsyncStore-48] TRACE org.infinispan.loaders.LockSupportCacheStore[212] - exit store(ImmortalCacheEntry{key=c415505dca69be631ca5d391b3ccd2b44b52d017, value=ImmortalCacheValue {value=com.test.service.shared.cache.cacheables.UserProfileData@4c86380a}})

2012-06-22/14:12:14.724/EDT [CoalescedAsyncStore-48] TRACE org.infinispan.loaders.decorators.AsyncStore[427] - Release lock for key de0230a7787926f4d061dd366c132af3a9ee33f6

2012-06-22/14:12:14.724/EDT [CoalescedAsyncStore-48] TRACE org.infinispan.loaders.decorators.AsyncStore[427] - Release lock for key c415505dca69be631ca5d391b3ccd2b44b52d017

2012-06-22/14:12:14.724/EDT [CoalescedAsyncStore-48] TRACE org.infinispan.loaders.decorators.AsyncStore[427] - Release lock for key 01901175a7e99c387566ac191743e0d6156c0d34

2012-06-22/14:12:15.466/EDT [CacheViewTrigger,portal3.testme.com-12013] TRACE org.infinispan.cacheviews.CacheViewsManagerImpl[817] - Woke up, shouldRecoverViews=false

2012-06-22/14:12:16.471/EDT [CacheViewTrigger,portal3.testme.com-12013] TRACE org.infinispan.cacheviews.CacheViewsManagerImpl[817] - Woke up, shouldRecoverViews=false

...

After this, the put is not returning any more. I did not see timeout either. Jstack does not detect "DEADLOCK", but I saw lot of threas (ColaescedAsynStore...) in WATING.

Any idea? what else should I try?

1. Re: cache put stucks forever in a 3 node cluster

dex80526 Jun 26, 2012 11:40 AM (in response to dex80526)

Anyone has any suggestions or tips? I ran out of things to try right now. It seems to me Infninispan just can not scale beyond 2 node cluster ( at least in replication mode). It is unthinkable.
Actions
2. Re: cache put stucks forever in a 3 node cluster

darrellburgan Jun 26, 2012 12:48 PM (in response to dex80526)

One suggestion is to put a sleep at the top of your tight loop and see if the problem goes away or the behavior changes. If it does, that would tend to indicate that there is some time-based race condition going on inside Infinispan ...
Actions
3. Re: cache put stucks forever in a 3 node cluster

dex80526 Jun 26, 2012 3:40 PM (in response to darrellburgan)

thanks Darrell for your tip. We can do all things in our code, but I just can not understand the put() method does not return and does not give any errors (it sounds to me a deadlock). Earlier, we ran into performance/scaling issues on our session replications with 3 node cluster, and we have to disable the session replication for heavy load 3 node cluster. But, that is really stink.

I tried to tune some jgroups-tcp.xml parameters. I just maske some changes to the following parameers and the made significant differences, although I did not if they have any other (negative) side effects:

increase max_bundle_size from 64k to 128k
increase max_bundle_timeout from 30 to 60

increase thread_pool.max_threads from 30 to 60
set oob_thread_poll.queue_enabled="true"

increase UFC max_credits from 200k to 300k
increase MFS max_credits from 200k to 300k
increase FRAG2 frag_size from 60k to 120k

Please some one who knows the details on these paraters and their effects give a some explaination on how to tune them. I feel there are other parameters may need to be adjusted.
Actions
4. Re: cache put stucks forever in a 3 node cluster

darrellburgan Jun 26, 2012 4:24 PM (in response to dex80526)

I would like to echo that it would be really great if there was a document that explained how to tune JGroups specifically for Infinispan. The JGroups documentation goes into great detail, but it necessarily is generic documentation that any program using JGroups could use. It's very hard to take that general information and apply it to Infinispan without understand the guts of how Infinispan is coded.

So, I'd really love to see a document that explains, specific to Infinispan, which JGroups tags do what, and provides guidance as to how to tune the stack at the JGroups level ...
Actions
5. Re: cache put stucks forever in a 3 node cluster

dan.berindei Jun 27, 2012 9:26 AM (in response to darrellburgan)

@Darrell: Yes, a tuning document for JGroups with Infinispan would be great. However, it would have to be pretty generic as well: the best JGroups configuration depends both on how Infinispan is configured and on how it is used.

@dex, this might be a problem in the async cache store code... If you can share some runnable code that reproduces the problem, that would be great. Otherwise, can you post some information like how big your keys/values are, your full Infinispan/JGroups config and a thread stack dump of all 3 nodes?

I don't think any of the JGroups configuration changes you made actually improve things, except maybe the UFC/MFC changes (but on the other hand, increasing those is bad for state transfer). oob_thread_poll.queue_enabled=true would be pretty bad if you had sync replication, but with async replication it doesn't really matter.
Your FRAG2 frag_size is way too high, you should stick to the 60k default if you're using UDP (can't send >64k datagrams anyway) or remove FRAG2 completely if you're using TCP.
Actions
6. Re: cache put stucks forever in a 3 node cluster

dex80526 Jun 27, 2012 10:09 AM (in response to dan.berindei)

thanks Dan!

The whole code is big and complicated since we use Infinispan as embeded one. I capatured stack dumps on all 3 nodes, and they are large. Is is possible to send them to you directly?

Each cache entry is very small (<200K). The cache uses JDBC cache store (Derby) in async mode. I am using TCP for jgroups.

For some reason, after the change of those parameters, the same code competes reliably on the same 3 node-cluster. Before the change, the load almost stucks every time.

The tunning is a hit and miss right now since I do not have any reference docs on those parameters, and what they do.

Are you saying that FRAG2 is only appliccable for UDP?

What are the bundle.size and bundle.timeout for?
Actions
7. Re: cache put stucks forever in a 3 node cluster

dex80526 Jun 27, 2012 11:21 AM (in response to dan.berindei)
Dan: I am attaching the jstacks from all 3 nodes. Please let me kown if you need other info. thanks for looking into this. This is the last hurdle for us to release the code right now.

jstack.zip 23.2 KB
Actions
8. Re: cache put stucks forever in a 3 node cluster

darrellburgan Jun 27, 2012 12:59 PM (in response to dan.berindei)

Dan Berindei wrote:

Your FRAG2 frag_size is way too high, you should stick to the 60k default if you're using UDP (can't send >64k datagrams anyway) or remove FRAG2 completely if you're using TCP.

This is exactly the kind of JGroups configuration advice I need. We are using TCPGOSSIP and have a FRAG2 set to 60K. I will remove it ....
Actions
9. Re: cache put stucks forever in a 3 node cluster

dan.berindei Jun 27, 2012 1:06 PM (in response to dex80526)
dex chen wrote:

The whole code is big and complicated since we use Infinispan as embeded one. I capatured stack dumps on all 3 nodes, and they are large. Is is possible to send them to you directly?

Thanks, I saw them in the forum post.
I think I can see the problem in the stack dump of portal2: since you're in async mode, all commands are replicated using a single thread: "Scheduled-replicationQueue-thread-0". Because of MFC, this thread got stuck waiting for more credits from the other nodes. Eventually the replication queue filled up and the user thread started blocking as well.

dex chen wrote:

Each cache entry is very small (<200K). The cache uses JDBC cache store (Derby) in async mode. I am using TCP for jgroups.

That's not so small... I was expecting something like 10k

For some reason, after the change of those parameters, the same code competes reliably on the same 3 node-cluster. Before the change, the load almost stucks every time.

I think increasing the MFC max_credits is probably responsible for your code starting to work. Unfortunately I don't know exactly why it wouldn't work before, but it might be because the FRAG2 fragment size was greater than the MFC min_threshold*max_credits (60k > 0.2 * 200k). So the receiver thought the sender still had enough credits, but the sender saw it didn't have enough credits to send another fragment.

But 120k > 300k * 0.2 as well, so I'm not sure why you're not seeing the problem anymore. I'll see if I can write a small test for this, but in the meantime I'd suggest keeping MFC.max_credits higher then FRAG2.frag_size / MFC.min_threshold (and same for UFC).

The tunning is a hit and miss right now since I do not have any reference docs on those parameters, and what they do.

Yeah, even the JGroups manual doesn't describe everything... I think Bela was working on a presentation on tuning for JBossWorld, I hope he's going to put that online as well.

Are you saying that FRAG2 is only appliccable for UDP?

I was thinking that way, yes, but apparently it is needed for UFC/MFC to work properly as well - even with TCP.

What are the bundle.size and bundle.timeout for?

These are used when you have the opposite of your case - lots of small packages. JGroups will try to bundle more than one message in the same IP packet, to make it more efficient. Something like Nagle's algorithm in TCP.
Actions
10. Re: cache put stucks forever in a 3 node cluster

dex80526 Jun 27, 2012 1:50 PM (in response to dan.berindei)

Dan, thanks for looking into this.

I correct myself here. The avearage size of each cache entry is around 1000Bytes (1kB).

The thing I do not understand is that the replication/put did not timeout nor recover after it got stucks. In addition, I am using TCP, is it should UFC, not MFC, be used? Is there way to use a pool of thread of Scheduled-replicationQueue-thread?

I need to mention that with those parameters in my previous post I see long start up time of some nodes, and lot of cach view prepare timeout exceptions.

I tried some parameters again with my test environment (this time, I only change 2 parameters: the thread pool size), and it worked.

keep max_bundle_size at 64k
keep max_bundle_timeout at 30

increase thread_pool.max_threads from 30 to 60
set oob_thread_poll.queue_enabled="false"
increase oob_thread_pool.max_threads from 30 to 60
keep UFC max_credits at 200k
keep MFC max_credits at 200k
keep FRAG2 frag_size at 60k

In other words, I just double the thread pool sizes from my orginal configuration, and the new configuration allows me to complete the import, and the startup time is good too.

It seems that increasing max_credits has negative effects on the startup (cluster forming).

The scaling (up) to more nodes (5-7 nodes) with Infinispan (with replication) appears less and less feasible with the experience we had so far. I may have to re-think our approach.

Infinisapn (cache and cachce store/loader along with replication mode) really provides a good alternative for us to build a share-nothing cluster. But, it will be too limited to be deployed if we can not scale up in terms of cluster size (upto dozen nodes) and performance.
Actions
11. Re: cache put stucks forever in a 3 node cluster

dan.berindei Jun 28, 2012 6:47 AM (in response to dex80526)
dex chen wrote:

I correct myself here. The avearage size of each cache entry is around 1000Bytes (1kB).

Ok, then my explanation with FRAG2.frag_size > MFC.max_credits * MFC.min_threshold is certainly off the track...

The thing I do not understand is that the replication/put did not timeout nor recover after it got stucks. In addition, I am using TCP, is it should UFC, not MFC, be used? Is there way to use a pool of thread of Scheduled-replicationQueue-thread?

The fact that it didn't time out is a problem with our async replication code - we kind of assume that we'll always be able to send stuff down the wire without waiting too long, and that's not always true (with UFC/MFC or even with just TCP, when there is real congestion).

UFC/MFC are higher-level protocols, so which one is used depends only on how you send the message: unicast or broadcast to the entire cluster. In replicated mode we tend to use broadcast all the time.

Using multiple replication queue threads wouldn't help - if the supply of "credits" in UFC/MFC is gone, no thread can send anymore.

I need to mention that with those parameters in my previous post I see long start up time of some nodes, and lot of cach view prepare timeout exceptions.

I tried some parameters again with my test environment (this time, I only change 2 parameters: the thread pool size), and it worked.

keep max_bundle_size at 64k
keep max_bundle_timeout at 30

increase thread_pool.max_threads from 30 to 60
set oob_thread_poll.queue_enabled="false"
increase oob_thread_pool.max_threads from 30 to 60
keep UFC max_credits at 200k
keep MFC max_credits at 200k
keep FRAG2 frag_size at 60k

In other words, I just double the thread pool sizes from my orginal configuration, and the new configuration allows me to complete the import, and the startup time is good too.

It seems that increasing max_credits has negative effects on the startup (cluster forming).

The weird thing is that the stack dumps you sent show only two threads in JGroups' regular thread pool on each node, and none of them is busy. So I'm not sure how changing the maximum from 30 to 60 could change anything...

About max_credits: yes, we have noticed in our testing as well that a big max_credits value can increase the time it takes to transfer state, because all nodes send to the joiner at the same time. We never saw cache view prepare timeouts with just 3 nodes, though...

The scaling (up) to more nodes (5-7 nodes) with Infinispan (with replication) appears less and less feasible with the experience we had so far. I may have to re-think our approach.

Infinisapn (cache and cachce store/loader along with replication mode) really provides a good alternative for us to build a share-nothing cluster. But, it will be too limited to be deployed if we can not scale up in terms of cluster size (upto dozen nodes) and performance.

I'm pretty sure we do run tests with 8 nodes in replicated mode, so it could be something particular about your environment...

Could you revert to the default TCP configuration and, after the put blocks, check the MFC credits on each node (via JMX)? I have another suspicion now, that somehow the receiver thinks the sender has enough credits, but the sender thinks it doesn't.
If the numbers don't match, run again with TRACE enabled for org.jgroups only and post the logs here, I'm sure we can get to the bottom of this
Actions
12. Re: cache put stucks forever in a 3 node cluster

dan.berindei Jun 28, 2012 7:02 AM (in response to darrellburgan)

Darrell Burgan wrote:

Dan Berindei wrote:

Your FRAG2 frag_size is way too high, you should stick to the 60k default if you're using UDP (can't send >64k datagrams anyway) or remove FRAG2 completely if you're using TCP.

This is exactly the kind of JGroups configuration advice I need. We are using TCPGOSSIP and have a FRAG2 set to 60K. I will remove it ....

Sorry Darrell, I was wrong - as long as you have UFC/MFC in the stack you need to keep FRAG2 as well (see https://issues.jboss.org/browse/JGRP-590).
Until we have something better, you can take the stock jgroups-udp.xml and jgroups-tcp.xml in infinispan-core.jar as our "guidelines", even if they're not explained very well. You can certainly diverge from those defaults, but unless you see a definite improvement with the changed parameters I recommend reverting to the defaults. On the other hand, if you do see improvements, we'd like to hear about that as well, and maybe even change the defaults.

By the way, TCPGOSSIP is only the discovery protocol, which doesn't have anything to do with regular messages (either unicast or broadcast). I was talking about the transport protocol, which can be either UDP or TCP (there are a few other options, but they're not widely used).
Actions
13. Re: cache put stucks forever in a 3 node cluster

dex80526 Jun 28, 2012 12:14 PM (in response to dan.berindei)

Hi Dan: Thanks for reading through this. I turned off JMX in my configuration because I saw some issues. Is there any way (e.g, java API call) to see the credits?

Now, it pauzzles me after reading your comments on credits since my latest change uses same MFC/UFC configuration as before. How could that change the behaviour?

Is it possible to share how your 8-node replication cluster configurated and how the performance looks like?
Does your configuration have cache stores?

We definitely see the overall performance of 3 node cluster is SIGNIFICANTLY degraded over 2 node cluster.

For example, the import test mentioned here we do not see any issue with 2 node cluster and takes less than 1/2 time of 3 node cluster takes to do the same.

Maybe, we did not tune the configuration right. That once again shows importance to have a performacen tunning guide.

Thanks again for your help.
Actions
14. Re: cache put stucks forever in a 3 node cluster

dan.berindei Jun 29, 2012 2:31 AM (in response to dex80526)
dex chen wrote:

Hi Dan: Thanks for reading through this. I turned off JMX in my configuration because I saw some issues. Is there any way (e.g, java API call) to see the credits?

Now, it pauzzles me after reading your comments on credits since my latest change uses same MFC/UFC configuration as before. How could that change the behaviour?

Is it possible to share how your 8-node replication cluster configurated and how the performance looks like?
Does your configuration have cache stores?

We definitely see the overall performance of 3 node cluster is SIGNIFICANTLY degraded over 2 node cluster.

For example, the import test mentioned here we do not see any issue with 2 node cluster and takes less than 1/2 time of 3 node cluster takes to do the same.

Maybe, we did not tune the configuration right. That once again shows importance to have a performacen tunning guide.

Thanks again for your help.

You can get a hold of any protocol from the cache manager:

MFC mfc = (MFC) ((JGroupsTransport)cacheManager.getTransport()).getChannel().getProtocolStack().findProtocol(MFC.class);

I really have no explanation as to why your cluster stopped hanging, but I do know that in your thread dump I found the replication queue thread stuck because it didn't have enough credits to send the next message:

"Scheduled-replicationQueue-thread-0" daemon prio=10 tid=0x000000000dcd0800 nid=0x35d0 waiting on condition [0x000000004321f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <0x0000000775799ce8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2176)
        at org.jgroups.util.CreditMap.decrement(CreditMap.java:157)
        at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:104)
        at org.jgroups.protocols.FlowControl.down(FlowControl.java:341)
        at org.jgroups.protocols.FRAG2.down(FRAG2.java:147)
        at org.jgroups.protocols.RSVP.down(RSVP.java:150)
        at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1033)
       at org.jgroups.JChannel.down(JChannel.java:730)

So your problem is definitely related to UFC/MFC... and it could be that the thread pool size change didn't completely eliminate it, it just made it harder to reproduce.

I checked and apparently our regular performance tests only go up to 4 nodes in replication mode (distribution mode tests do go up to 8 nodes). They don't use any cache store, and they use the stock jgroups-udp.xml that ships with Infinispan (actually the stock standalone.xml that ships with JDG, but they're equivalent).

Actually this is something that I completely missed in your original post: you're using TCP, but for replicated caches we recommend using UDP, which supports multicasting, because we always send commands to the entire cluster.

If you're using TCP, every put command is going to get sent (clusterSize - 1) times. Since your replication is async, you're not going to see this cost on until the replication queue fills up, but on average I'm not so surprised that the throughput of your test (doing puts on just one node and replicating them to the entire cluster) dropped in half after adding the 3rd node. However, if you had a test that ran put() calls on all the cluster nodes, I'm sure the total throughput wouldn't have changed that much when you added the 3rd node.

I'd recommend switching to UDP (using the stock jgroups-udp.xml), that should improve your performance. I'm still interested in why you're seeing the hang with your initial config, so if you can post here full TRACE logs of org.jgroups I'd appreciate it.
Actions

1 2 Previous Next

Go to original post