3 Replies Latest reply on Sep 13, 2011 6:01 AM by galder.zamarreno

load test problems

dmitry123 Aug 29, 2011 6:11 PM

Hi guys,

I ran a load test on a cluster with a replicated cache spread over two machines with each machine hosting 2 or 3 nodes(4 or 5 nodes total in the cluster). One node is a dedicated listener and after a startup it doesn't send messages to the cache. The other nodes( 3 or 4 of them) in the cluster write non overlapping keys to the cache every 1000ms. This works well when i start the coordinator node(the one that doesn't write) first and then bring up the other nodes(that write at 1000ms intervals) every 40 seconds. Problems begin when I bring up another node that writes at 1000ms intervals. The symptoms are dropped messages, and the coordinator(non writing node) locks up and becomes a SUSPECT. Is this normal behavior? Doesn't seem like 1 listener node + 4 write nodes of mutually exclusive data at 1000ms intervals should cause problems. And the symptoms are worse when the write interval is less than 1000ms in each node. Many thanks for looking at this.

jdk 1.6_24

infinispan 5.0.Final

JGroups version: 2.12.1.1.Final

Here's my config and snippet of code:

jgroups-tcp.xml

**********************************************************

<config xmlns="urn:org:jgroups"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">

<TCP

bind_addr="${jgroups.tcp.address:123.12.101.42}"

bind_port="${jgroups.tcp.port:46522}"

loopback="true"

port_range="30"

recv_buf_size="20000000"

send_buf_size="640000"

discard_incompatible_packets="true"

max_bundle_size="64000"

max_bundle_timeout="30"

enable_bundling="true"

use_send_queues="true"

sock_conn_timeout="300"

enable_diagnostics="false"

thread_pool.enabled="true"

thread_pool.min_threads="2"

thread_pool.max_threads="30"

thread_pool.keep_alive_time="5000"

thread_pool.queue_enabled="false"

thread_pool.queue_max_size="100"

thread_pool.rejection_policy="Discard"

oob_thread_pool.enabled="true"

oob_thread_pool.min_threads="2"

oob_thread_pool.max_threads="30"

oob_thread_pool.keep_alive_time="5000"

oob_thread_pool.queue_enabled="false"

oob_thread_pool.queue_max_size="100"

oob_thread_pool.rejection_policy="Discard"

<MPING bind_addr="${jgroups.bind_addr:123.12.101.42}" break_on_coord_rsp="true"

mcast_addr="${jgroups.udp.mcast_addr:228.6.7.8}" mcast_port="${jgroups.udp.mcast_port:46655}" ip_ttl="${jgroups.udp.ip_ttl:2}"

num_initial_members="3"/>

<MERGE2 max_interval="30000"

min_interval="10000"/>

<FD_SOCK/>

<VERIFY_SUSPECT timeout="5000"/>

<pbcast.NAKACK

use_mcast_xmit="false" gc_lag="0"

retransmit_timeout="300,600,1200,2400,4800"

discard_delivered_msgs="false"/>

<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

max_bytes="400000" />

<pbcast.GMS print_local_addr="true" join_timeout="10000" view_bundling="true"

merge_timeout="20000" print_physical_addrs="true"/>

<pbcast.STREAMING_STATE_TRANSFER/>

</config>

cache.xml

********************************************************

<?xml version="1.0" encoding="UTF-8"?>

</properties>

</transport>

</global>

<transaction

transactionManagerLookupClass="org.infinispan.transaction.lookup.JBossStandaloneJTAManagerLookup"

syncRollbackPhase="false"

syncCommitPhase="false"

useEagerLocking="true"/>

</clustering>

</properties>

</indexing>

</default>

</namedCache>

</infinispan>

relevant code

*************************************************

public static void populateCache(){

p = new Thread(new Runnable(){

public void run() {

for (int i = startIndex; i < startIndex+ 1000; i++) {

Book b = getUniqueBook();

cache.putAsync(i, b);

try {

Thread.sleep(1000);

} catch (InterruptedException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

});

}

1. Re: load test valid?

dmitry123 Aug 29, 2011 5:10 PM (in response to dmitry123)

i changed userEagerLocking to false and now i'm up to 5 write nodes + 1 quiet node with the same test. on the startup of the 6th write node that node freezes after the state is loaded. the last message in the log is:

2011-08-29 16:37:28,178 DEBUG [StateTransferManagerImpl] (Incoming-2,infinispan-test-cluster,abc-36821) State applied, closing object stream
2011-08-29 16:37:28,181 INFO [RpcManagerImpl] (main) ISPN000076: Successfully retrieved and applied state from xyz-45780

coordinator log says:
2011-08-29 16:37:21,964 DEBUG [JGroupsTransport] (Incoming-8,infinispan-test-cluster,abc-45780) New view accepted: [abc-45780|6] [abc-45780, abc-40668, abc-59389, xyz-20952, xyz-4484, xyz-1992, xyz-36821]
2011-08-29 16:37:21,964 INFO [JGroupsTransport] (Incoming-8,infinispan-test-cluster,abc-45780) ISPN000094: Received new cluster view: [abc-45780|6] [abc-45780, abc-40668, abc-59389, xyz-20952, xyz-4484, xyz-1992, xyz-36821]
2011-08-29 16:37:22,512 DEBUG [StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-5,infinispan-test-cluster,abc-45780) Generating state. Can provide? true
2011-08-29 16:37:22,514 DEBUG [StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-5,infinispan-test-cluster,abc-45780) Writing 495 StoredEntries to stream
2011-08-29 16:37:22,567 DEBUG [StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-5,infinispan-test-cluster,abc-45780) State generated, closing object stream

and then coordinator goes into this state:
2011-08-29 16:38:04,095 WARN [FD] (OOB-74,infinispan-test-cluster,abc-45780) I was suspected by xyz-36821; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2011-08-29 16:38:07,096 WARN [FD] (OOB-73,infinispan-test-cluster,abc-45780) I was suspected by xyz-36821; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2011-08-29 16:38:10,097 WARN [FD] (OOB-77,infinispan-test-cluster,abc-45780) I was suspected by xyz-36821; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2011-08-29 16:38:12,107 DEBUG [JGroupsTransport] (Incoming-11,infinispan-test-cluster,abc-45780) New view accepted: [abc-45780|7] [abc-45780, abc-40668, abc-59389, xyz-20952, xyz-4484, xyz-1992]
2011-08-29 16:38:12,108 INFO [JGroupsTransport] (Incoming-11,infinispan-test-cluster,abc-45780) ISPN000094: Received new cluster view: [abc-45780|7] [abc-45780, abc-40668, abc-59389, xyz-20952, xyz-4484, xyz-1992]

this results in cache being in inconsistent state with the 6th node having more cache entries then the other 6 nodes(5 write + 1 quiet) and some messages get dropped all together. any ideas?
Actions
2. Re: load test valid?

dmitry123 Aug 29, 2011 7:02 PM (in response to dmitry123)

a couple of changes got me to 7 stable write nodes + 1 quiet node.

i changed failure detection to FD_ALL and increased join_timeout in GMS to 15000ms. on the addition of the 8th write node it starts to have trouble with accepting messages because the 8th node is not in the view table. but this could be my test's fault. will post if i find the answer.
Actions
3. Re: load test valid?

galder.zamarreno Sep 13, 2011 6:01 AM (in response to dmitry123)

Dmitry, maybe GC is kicking in and that's causing false suspicions. Remember that if a node is doing a lot of GC, it won't be able to respond to other nodes. So, the usual stuff applies here:

- get some GC logs
- see if ParNewGC and/or UseConcMarkSweepGC options could help
- use large pages if heaps are big (above 4gb or so)

If not and you're having still issues, get some thread dumps to see what is going on
Actions

Go to original post