Infinispan scalability consulting| JBoss.org Content Archive (Read Only)

15. Re: Infinispan scalability consulting

darrellburgan Jun 12, 2012 12:13 PM (in response to belaban)

Bela Ban wrote:

I'll let the Infinispan folks reply to the part about lockless operation...

You mentioned that you use either async invalidation or replication. If you use replication, you're never going to scale to a large cluster or large data set, as the heap size used is a function of the cluster size and the average data set size of every node. If you have 10 nodes, and every node has ca. 100MB of data, then every node needs to have 1GB of free heap to accommodate the data. Plus some more for Infinispan and JGroups metadata etc.

Of course, you could always use a cache loader to overflow data to disk, but this will slow down your system (depending on the cache hit/miss ratio).

We found that DIST scales almost linearly; the cost for a read or write is always constant, regardless of the cluster size. I'm going to talk about this at JBossWorld in 3 weeks. I don't know your architecture, but DIST (perhaps in combination with an L1 cache) could be some thing worth looking into...

Yeah I need to do an experiment with converting all of our replication caches to distribution caches. They're not appropriate as invalidation caches, because invalidation caches seem to have the behavior of ensuring that a given cache entry exists only in one node in the cluster. I.e. if node A has entry X in a cache, and node B puts X into its local cache, X gets evicted from node A's cache. So I think distribution, probably with an L1 cache, is what I'll have to use instead of replication.

Free memory is not an issue AFAIK. We have 16GB heap sizes configured right now, and could easily expand them to 32GB with a restart. The only issue with memory we really have is gc collection times, which can cause Infinispan real grief.

I'll try some experiements with distribution instead of replication and see what I get.

Do you have thoughts about using TCP v. UDP/multicast at the JGroups level? We're currently using TCP, but we are strongly considering going to UDP/multicast to lighten the network load somewhat. Is this a good idea or a bad idea?

16. Re: Infinispan scalability consulting

darrellburgan Jun 12, 2012 12:15 PM (in response to belaban)

Bela Ban wrote:

When you mention that your production volume is going up, are you referring to the number of entries in the cache, or the number of accesses to them, or both ?

Let me stress again, that replication is not a good solution if either one or both or them increase.

Re using TCP: I don't recommend TCP with replication if your cluster size increases, as a message M to the entire cluster (of N nodes) will have to be sent N-1 times. So for a cluster of 10, everyone sending a message leads to 81 message being on the network at the same time. Even worse, we'll have a mesh of 81 TCP connections. For replication, I definitely recommend UDP over TCP.

If you use distribution (DIST), then either UDP or TCP can be used, as we're sending fewer messages (only to the backups, not the entire cluster).

Doh, ignore my question in my previous response, I see now that you already answered it. Sounds like UDP/multicast would be beneficial.

By 'volume going up', I mean that the number of database transactions per second is going up, with Infinispan cache get(), put() and delete() calls going up at the same rate.

17. Re: Infinispan scalability consulting

darrellburgan Jun 12, 2012 12:16 PM (in response to dex80526)

dex chen wrote:

We saw scaling issues in our load testing. I'll post it in a seprate thread in detail late. But, for people following this thread, the issue we see could be summrized this:

We have 2 node cluster in replcation (in 2 node cluster "DIST" should be similar to "REPLICATION"), and have dozen loader drivers (similar to JMETER) to drive the requests to the cluster through a load balancer.

We found the performance is singnificat lower in a 2 node cluster configuration than a single node with the same physical capacity. We saw no errors/exceptions in single node run, but saw Transaction timeout.

I am going to do the same test with data replication disabled to see if the data replication is the bottle neck. But, to me the big suprise is that the 2 node cluster can not even perform at the same level as the single node.

Please do share the results of your tests; I'm very interested in them. I'll likewise share whatever I learn as I'm moving along ...

18. Re: Infinispan scalability consulting

prabhat.jha Jun 12, 2012 12:18 PM (in response to dex80526)

When you go from 1 node to 2 node, you are adding a network communication overhead so even in theory 2 node cluster is supposed to have less performance than 1 node if you have configured your cluster to use REPL SYNC, no?

19. Re: Infinispan scalability consulting

dex80526 Jun 12, 2012 4:15 PM (in response to prabhat.jha)

That's what we saw so far. But, you think about the following conceptually, we expect the overall performance of 2 node cluster should be biger than single node:

Just for sake of discussion:

1) we have 1 loader driver to drive m requests/s to a single node, and we get the performance from single node is m transactions/s.

2) with 2 node cluster in replication ASYNC, we have 2 loader drivers to drive the load (m transaction/s for each node), ideally without any overhead, we expect total performance of 2 node cluster is 2xm transactions/s.

With replication overhead, we should still expect the overall performance > m transactions/s. But, in our tests, we see the overall performance is less than m transactions/s. This is hard to explain to me.

Even with REPL SYNC, to me, the overall performance should not less than single node in the above settings unless the replication takes longer than real transaction (without replication).

I am wondering if Infinispan is trying to send all data in cache to another node when it replicates data from one node to other, instead of seding only the change (delta) from last commit.

Do we have scalability testing data of Infinispan somewhere? I am interested in learning what kind test configuration and scalability look like. In my case, I am just need to scale up 3-5 nodes right now.

20. Re: Infinispan scalability consulting

belaban Jun 13, 2012 3:26 AM (in response to dex80526)

Yes, in a 2 node cluster, replication and dist (with numOwners=2) are pretty much the same.

A 2 node cluster should perform at the same level (or higher) than a single node if you use *asynchronous* replication (or dist). If you use *sync* replication, then every request not only has to be serialized, and get sent to the remote node, but the thread making the modification also blocks until the response has been received. This is all done on the critical path, so I doubt a 2-node cluster will be faster than a single node.

A 2 node cluster might increase the total number of client requests, but it will certainly be slower on a *per-client basis*. The former only applies if you have many clients; if you only have a few, then overall performance (with sync repl) will be slower than in a single node scenario.

21. Re: Infinispan scalability consulting

belaban Jun 13, 2012 3:34 AM (in response to darrellburgan)

Darrell Burgan wrote:

We found that DIST scales almost linearly; the cost for a read or write is always constant, regardless of the cluster size. I'm going to talk about this at JBossWorld in 3 weeks. I don't know your architecture, but DIST (perhaps in combination with an L1 cache) could be some thing worth looking into...

Yeah I need to do an experiment with converting all of our replication caches to distribution caches. They're not appropriate as invalidation caches, because invalidation caches seem to have the behavior of ensuring that a given cache entry exists only in one node in the cluster. I.e. if node A has entry X in a cache, and node B puts X into its local cache, X gets evicted from node A's cache. So I think distribution, probably with an L1 cache, is what I'll have to use instead of replication.

Free memory is not an issue AFAIK. We have 16GB heap sizes configured right now, and could easily expand them to 32GB with a restart. The only issue with memory we really have is gc collection times, which can cause Infinispan real grief.

I'll try some experiements with distribution instead of replication and see what I get.

Do you have thoughts about using TCP v. UDP/multicast at the JGroups level? We're currently using TCP, but we are strongly considering going to UDP/multicast to lighten the network load somewhat. Is this a good idea or a bad idea?

Note that if you use session replication, an L1 cache is not needed; as a matter of fact it'll slow things down due to invalidation traffic !

Re having 16GB or even 32GB heap: here you really need to tune your JVM options, as the current rule of thumb has 1 sec of full GC per GB of memory, so in the latter case you could have up to 30s of GC pauses.

After that FC tuning, I suggest measure your highest GC cycle (e.g. for a couple of days) and then you can adjust your JGroups and Infinispan configuration. For instance, if you have FD_ALL (or FD) in your stack, make sure its timeout is higher than the highest GC cycle that can occur. This will reduce the number of times a cluster excludes a GC'ing member, only to merge it back later.

My suggestion is always to combine FD_SOCK with FD_ALL (UDP) or FD (TCP) and set the timeout in the latter to a high value.

Infinispan tuning would then include setting the blocking RPC timeouts accordingly.

22. Re: Infinispan scalability consulting

darrellburgan Jun 13, 2012 11:49 AM (in response to belaban)

Bela Ban wrote:

Note that if you use session replication, an L1 cache is not needed; as a matter of fact it'll slow things down due to invalidation traffic !

Sorry I'm a bit confused. I'm talking about the L1 option for Infinispan distribution caches, i.e. configurationBuilder.clustering().l1(). The version of our system under discussion does not use JPA or Hibernate, so there is no "level 1" cache in the Hibernate sense.

Given that, I'm not understanding what you mean by session replication ... ?

Also, am I to understand that Infinispan is smart enough to invalidate entries in the clustering().l1() "level 1" cache? I.e. if I am using distribution, and I do get(X) which loads X from another node in the cluster, and I have Infinspan l1() configured for 5 secs, if another node invalidates or changes X, will I automatically see the change? Or will it take up to 5 seconds before I see it?

23. Re: Infinispan scalability consulting

belaban Jun 13, 2012 11:56 AM (in response to darrellburgan)

I'm referring to the L1 used with DIST inside of an AS7 instance, in conjunction with HTTP session replication (distribution). Sorry for the confusion, this probably doesn't apply to you

24. Re: Infinispan scalability consulting

dex80526 Jun 13, 2012 12:15 PM (in response to belaban)

Bela, thanks for the tip on turnning the jgroups config. We have the similar situation here. Our JVM heap sapce could be up to 16 GB. Do we need to increase VERIFY_SUSPECT timeout along with FD's timeout?

25. Re: Infinispan scalability consulting

darrellburgan Jun 13, 2012 2:42 PM (in response to belaban)

After that FC tuning, I suggest measure your highest GC cycle (e.g. for a couple of days) and then you can adjust your JGroups and Infinispan configuration. For instance, if you have FD_ALL (or FD) in your stack, make sure its timeout is higher than the highest GC cycle that can occur. This will reduce the number of times a cluster excludes a GC'ing member, only to merge it back later.
My suggestion is always to combine FD_SOCK with FD_ALL (UDP) or FD (TCP) and set the timeout in the latter to a high value.
Infinispan tuning would then include setting the blocking RPC timeouts accordingly.

Great suggestion, I will try that. Are there any other timeouts I should look at to increase beyond the longest gc pause?

26. Re: Infinispan scalability consulting

prabhat.jha Jun 13, 2012 2:55 PM (in response to darrellburgan)

Current Infinsipan configuration XSD is at https://github.com/infinispan/infinispan/blob/master/core/src/main/resources/schema/infinispan-config-5.2.xsd Going through different attributes would help tune it as well.

27. Re: Infinispan scalability consulting

darrellburgan Jun 13, 2012 4:53 PM (in response to darrellburgan)

Okay some load testing results. Here are the changes that I made:

Upgraded from 5.0.1.FINAL to 5.1.5.FINAL
Changed lock concurrency from 256 to 2048
Changed <FD> timeout to 25 secs from 2 secs
Changed all our replication caches to be distribution caches with a 5 sec <L1>

On a heavy 4-node load test, I've discovered that distribution caches perform better than replication caches, but still have serious trouble keeping up with the load. I've clocked distribution cache invalidations taking up to 38 seconds under this test, and distribution puts taking up to 45 seconds under this test.

Invalidation caches, however, stand up very well, with almost no hiccups under extremely heavy load and only a few invalidations/updates taking more than 5 secs.

My conclusion is that we should be using invalidation caches across the board. There is one crucial question I have to answer, though, for this to be a solution, which relates to whether invalidation caches allow the same cache entry to be present in the local caches of multiple cluster nodes. I already posed this question to the community here:

https://community.jboss.org/thread/201086

So it seems that it comes down to that question. If the answer is that invalidation only allows a given entry to be present in a single node in the cluster, then invalidation is not appropriate for our caches that are heavily referenced but change very infrequently. If that is the case, then it means Infinispan does not have a solution to that part of our problem, and we'll be forced to implement something different.

Does anyone have the answer to that question?

28. Re: Infinispan scalability consulting

belaban Jun 14, 2012 2:30 AM (in response to dex80526)

Hi Dex,

VERIFY_SUSPECT is used to double-check a SUSPECT(P) event where P is suspected, e.g. because we haven't received a heartbeat from P for a while. While anyone can suspect P, the verification is only done by the *coordinator*, so in most cases - before a member is excluded - it is suspected by 2 members.

Therefore the purpose of VERIFY_SUSPECT is to decrease the chances of a *false suspicion* (where P is suspected and excluded from the cluster although it is still alive).

Yes, the values in VERIFY_SUSPECT can be increased, too, e.g. timeout="10000" num_msgs="2". This means we'll do 2 double-checks within the time frame of 10s.

Note that if you use FD_ALL, you could also enable msg_counts_as_heartbeat, which means that a member P's time stamp is updated when you either receive a heartbeat *or a message from P*. Note though that updating the time stamp requires a hashmap lookup on every message, so this is somewhat costly...

29. Re: Infinispan scalability consulting

belaban Jun 14, 2012 3:17 AM (in response to darrellburgan)

Darrell Burgan wrote:

Okay some load testing results. Here are the changes that I made:

Upgraded from 5.0.1.FINAL to 5.1.5.FINAL
Changed lock concurrency from 256 to 2048
Changed <FD> timeout to 25 secs from 2 secs
Changed all our replication caches to be distribution caches with a 5 sec <L1>

On a heavy 4-node load test, I've discovered that distribution caches perform better than replication caches, but still have serious trouble keeping up with the load. I've clocked distribution cache invalidations taking up to 38 seconds under this test, and distribution puts taking up to 45 seconds under this test

I'll let someone from the Infinispan team answer your question re invalidation caches (I for one don't think that an element in an invalidation cache is present on one node only).

Can you describe a bit more how you run your perf test, e.g.:

- Access rate, how many reads, how many writes per sec ?

- How many clients ?

- How much of the access is going to the *same* data (write conflicts) ?

- What's your read/write ratio ? Infinispan, like other caches, is designed for high read access

- What the average data size in the cache, how many elements ?

I recently benched EAP 6, with a 2-16 node cluster. I used session replication, and 400 clients, doing a total of 100'000 requests. Every session had 10 attributes of 1000 bytes each.

With DIST-SYNC (L1 disabled), I got 25'000 requests / sec, with DIST_ASYNC (no L1) I got 30'000 requests / sec.

I'm actually going to talk about this at JBossWorld, the slides will be posted after JBW if you're interested.

Having said this, increasing the cluster size to 16 increased the numbers slightly, but at some point I hit a limit on the client driver. That limit was the same as when fetching a static web page from httpd (index.html).

So, what I'm trying to say is that benchmarking distribution yielded excellent results, which were very close to accessing a static web page.

My advise would be don't yet lose faith in distribution, perhaps you need to

- Change configuration (Infinispan, JGroups)

- Use transactions or batching to reduce cluster traffic

- Use AtomicHashMaps, etc.

A consultant should be able to dig into this and fix your issues.