4 Replies Latest reply on Apr 30, 2010 6:42 AM by manik

A few questions on performance

yelin666 Apr 27, 2010 11:52 PM

I did some performance testing with data relevant to our use case (note: useLockStriping is disabled in my cache config, and numOfOwners for distributed mode is 2). In my testing, I initialized a number of base data into the cache, and this initialization is done on all involving nodes with even load on each node and using its own key set without collision (for example, for a total data set of 1M, when 10 nodes participate, each node initializes 100k data with its own key set, say node1's keys start with "node1_..." & node2's key start with "node2_..."). After initialization, each node does a reading test at the same time, and then an updating test at the same time. During the updating test, each node updates a subset of data initialized by itself (with the above example, when each node updates 1% of data it initialized, it will be 1k, and the total data being updated on 10 nodes will be 10k). I got the following questions from my testing results.

1. When the base data set increased from 100k to 1M (the stored data are our custome objects), updating the same total amount of data took significant longer time (say when updating a total amount of 10k data with each node updating 1k for 10 nodes at the same time, on average it took around 70 milliseconds for 100k base data but near 600 milliseconds for 1M base data. Such kind of difference made me concerned about the scalability. Could you please suggest the factors that contribute to the difference?

2. When tested with the same base data set and updating the same total amount of data, but different node numbers. I was expecting it took less time for 10 nodes than 5 nodes as each node gets less updating load. However, when the stored data in the cache is Float, 10 nodes shows a little better result; and when the stored data are our custom objects, 10 nodes results are even worse. Any suggestions on what's causing such a result?

3. When tested with same base data set, same updating data amount on 10 nodes, the Distributed L1 mode performed similar to Replicated mode while I was expecting Distributed L1 outperforms. Not sure if 10 nodes can not show the difference, or other reason?

4. I tested with TCP vs. UDP comm, for the same scenario their performances are pretty close. While I am surprised by the good performance by TCP, I am wondering why UDP didn't do a better job?

I would appreciate your responses.

Lin

1. Re: A few questions on performance

manik Apr 28, 2010 6:10 AM (in response to yelin666)
Hi Lin.

1. When the base data set increased from 100k to 1M (the stored data are our custome objects), updating the same total amount of data took significant longer time (say when updating a total amount of 10k data with each node updating 1k for 10 nodes at the same time, on average it took around 70 milliseconds for 100k base data but near 600 milliseconds for 1M base data. Such kind of difference made me concerned about the scalability. Could you please suggest the factors that contribute to the difference?
Some things that may affect:
Eviction. Do you have this enabled? And if so, what eviction policy and maxEntries have you set? If the eviction thread needs to constantly scan entries to decide which to evict, this could get more expensive as cache size increases (at least in 4.0.0. 4.1.0 uses a better algorithm that would not affect as much).
Concurrency level. The core data container is very similar to a JDK ConcurrentHashMap, making use of lockable segments, etc. By default this is set to use a concurrency level of 1000, but if you have a particularly large cache (1M entries) with a large number of threads then you want to increase this to increase concurrency in accessing the core container. How many threads does your test use?
GC. With that many objects around, perhaps you are seeing GC churn? Have you run both cases (100k vs. 1M entries) under a profiler to help pinpoint where hotspots may be?

2. When tested with the same base data set and updating the same total amount of data, but different node numbers. I was expecting it took less time for 10 nodes than 5 nodes as each node gets less updating load. However, when the stored data in the cache is Float, 10 nodes shows a little better result; and when the stored data are our custom objects, 10 nodes results are even worse. Any suggestions on what's causing such a result?
Remote calls? With 10 nodes, it is more likely that both copies reside remotely (1/10 chance that one copy is local). With 5 nodes the probability is twice as high (1/5) so the number of remote calls is fewer. Also with Floats you don't see this so much since Floats are cheap to serialize and hence the RPC cost isn't that high - but this may not be the case for your custom objects?

3. When tested with same base data set, same updating data amount on 10 nodes, the Distributed L1 mode performed similar to Replicated mode while I was expecting Distributed L1 outperforms. Not sure if 10 nodes can not show the difference, or other reason?
This depends on your network. Usually once you exceed 5 nodes Distributed starts to outperform Replicated, but if you have a particularly fast network (10G ether or RDMA) this threshold could be a lot higher. Also, if you are using L1 caching, L1 needs UDP to be able to multicast invalidations to perform well. If you are using TCP, these L1 invalidations will have almost the same cost as a replicated setup.

Of course speed isn't everything: distributed mode gives you access to much more addressable space - (NUM_JVMS x JVM_SIZE) / NUM_OWNERS for distributed, but just JVM_SIZE for replicated (regardless of cluster size). Further, distributed scales linearly (adding more nodes) while with replicated, performance will degrade as you add more nodes to the cluster.

4. I tested with TCP vs. UDP comm, for the same scenario their performances are pretty close. While I am surprised by the good performance by TCP, I am wondering why UDP didn't do a better job?
This depends on scale. 10 nodes is still relatively small. UDP starts to perform much better as the number of nodes increases, but also this depends on how you tune your JGroups stack and your operating system. There is a lot of stuff that can be tuned here, from MTU, bundling, NAKACK timeouts, etc. Running stuff under a profiler would help you pinpoint what needs to be tuned.

Hope this helps!
Manik
Actions
2. Re: A few questions on performance

yelin666 Apr 28, 2010 11:04 PM (in response to manik)
Regard item 1, I don't feel the 3 factors you mentioned would matter in my testing for the following reason:
I don't have eviction enabled.
I have only one thread running the update on each instance, so the total thread number is 10 and a lot less than the default concurrency level.
Although the base data size are significantly different, those are the base data that reside in the cache all the time and I didn't remove any of them. So for both cases, when I updated the same amount of data, I don't expect there is a significantly different load on GC.

Any suggestion on what else to look at?
Actions
3. Re: A few questions on performance

sannegrinovero Apr 29, 2010 2:47 PM (in response to manik)

Remote calls? With 10 nodes, it is more likely that both copies reside remotely (1/10 chance that one copy is local).
Manik please correct me if I'm wrong, this should be 2/10 as he's using numOwners=2 right? and of course 2/10 is way lower than 2/5.
In both cases you have a 100% chance that an insert operation needs remoting, sending a packet to one or two buddies shouldn't make a great difference, so would it make sense when the data is so big that 5 nodes can handle it easily, and it's a read-mosty situation, to increase the numOwners a bit when increasing the number of nodes? it seems to me a low-cost operation on insertion and potential improvement in reading (Manik, you see theoretical advantages in this compared to L1 caching ?)

Lin, are you sure you are having enough memory to load this amount of objects? it might be possible that Long takes way less memory than your custom Objects, and if you didn't enable any sort of passivation you're keeping all those objects in VM's memory; you should consider both GC pressure due to the high number of references to scan for, but also the absolute number of bytes it's taking to store it all.
Actions
4. Re: A few questions on performance

manik Apr 30, 2010 6:42 AM (in response to sannegrinovero)

Sanne Grinovero wrote:

Remote calls? With 10 nodes, it is more likely that both copies reside remotely (1/10 chance that one copy is local).
Manik please correct me if I'm wrong, this should be 2/10 as he's using numOwners=2 right? and of course 2/10 is way lower than 2/5.
In both cases you have a 100% chance that an insert operation needs remoting, sending a packet to one or two buddies shouldn't make a great difference, so would it make sense when the data is so big that 5 nodes can handle it easily, and it's a read-mosty situation, to increase the numOwners a bit when increasing the number of nodes? it seems to me a low-cost operation on insertion and potential improvement in reading (Manik, you see theoretical advantages in this compared to L1 caching ?)
Sorry, yes, you are correct, the probability for just 1 remote call (instead of 2 remote calls) is 2/10 and 2/5.

The remote calls are still expensive though. Even though marshalling happens once (the same byte array is reused), the remote calls are synchronous (even though they happen in parallel), so you need to wait for all recipients to ack. Which is why doing just 1 remote call is really cheaper than 2.

Regarding the comparison with L1, one drawback with too many numOwners is that it consumes space, whether you need those copies or not. L1 at least is lazy and on-demand, so you only maintain the extra copies *if* you need them.
Actions

Go to original post