Issues Scaling Hotrod Cluster
markaddy Jul 4, 2011 8:57 AMHi,
When trying to add additional nodes to a running hotrod server cluster under load I am seeing the problems described below.
We are running 5.0.0.CR7 in distributed async mode, the hotrod server configuration file is attached.
The issue can be reproduced as follows:
1) Start 2 hotrod servers
2) Begin putting entries into the cache at a sustained rate
3) Start another hotrod server
When the third server starts a cluster view is created and a rehash of the data commences on the first two nodes. However the following stack trace is see in the original two server logs:
{code}2011-07-02 21:14:36,097 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (AsyncSerializationThread-1) ISPN00073: Unexpected error while replicating
org.infinispan.manager.NamedCacheNotFoundException: Cannot process command RehashControlCommand{type=APPLY_STATE, sender=HOTROD1-57202(SITE1), viewId=2, state=2477, oldConsistentHash=TopologyAwareConsistentHash {1115: HOTROD1-57202(SITE1), 6597: HOTROD2-49193(SITE1)}, consistentHash=TopologyAwareConsistentHash {93: HOTROD3-971(SITE1), 1115: HOTROD1-57202(SITE1), 6597: HOTROD2-49193(SITE1)}} on node HOTROD3-971(SITE1) Cache: testCache
at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:153)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:144){code}
On the joining node these corresponding messages are seen in the logs:
{code}2011-07-02 21:14:04,442 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) ISPN00094: Received new cluster view: [HOTROD1-57202(SITE1)|2] [HOTROD1-57202(SITE1), HOTROD2-49193(SITE1), HOTROD3-971(SITE1)]
2011-07-02 21:14:05,220 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) ISPN00079: Cache local address is HOTROD3-971(SITE1), physical addresses are [192.168.152.139:48528]
2011-07-02 21:14:06,017 INFO [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-6,HOTROD3-971(SITE1)) ISPN00067: Will try and wait for the cache to start
2011-07-02 21:14:06,027 INFO [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,HOTROD3-971(SITE1)) ISPN00067: Will try and wait for the cache to start
2011-07-02 21:14:06,361 INFO [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) ISPN00074: Trying to fetch state from HOTROD1-57202(SITE1)
2011-07-02 21:14:36,071 INFO [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-6,HOTROD3-971(SITE1)) ISPN00068: Cache named testCache does not exist on this cache manager!
2011-07-02 21:14:36,127 INFO [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,HOTROD3-971(SITE1)) ISPN00068: Cache named testCache does not exist on this cache manager!
2011-07-02 21:15:06,445 INFO [org.infinispan.remoting.InboundInvocationHandlerImpl] (Incoming-33,HOTROD3-971(SITE1)) ISPN00067: Will try and wait for the cache to start
2011-07-02 21:15:36,453 INFO [org.infinispan.remoting.InboundInvocationHandlerImpl] (Incoming-33,HOTROD3-971(SITE1)) ISPN00068: Cache named testCache does not exist on this cache manager!{code}
The rehash is failing because the startup of the joining node is blocked. Creation of the hotrod topology cache "___hotRodTopologyCache" is unable to complete therefore preventing creation of our named cache.
I took stack traces from the joining node which show the InfinispanServer-Main thread waiting for the ___hotRodTopologyCache state transfer to complete (joining-node-stack-trace.txt) and also from the exisiting node providing the state (existing-node-stack-trace.txt).
The existing node is attempting to obtain an exclusive cluster wide lock (JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:100) before providing topology state to the joiner - this can't happen as read locks are present while the clients are still putting entries into the cache.
Eventually the state transfer times out on the existing node and the cluster is also unusable.
{code}2011-07-02 21:18:06,466 ERROR [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (STREAMING_STATE_TRANSFER-sender-2,HOTROD1-57202(SITE1)) ISPN00095: Caught while responding to state transfer request
org.infinispan.statetransfer.StateTransferException: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-2,HOTROD1-57202(SITE1) could not obtain exclusive processing lock after 240 seconds. Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@28d51032[Read locks = 2] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@3c1a578f[Unlocked]
at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:165)
at org.infinispan.remoting.InboundInvocationHandlerImpl.generateState(InboundInvocationHandlerImpl.java:248)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.getState(JGroupsTransport.java:587)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:690)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
at org.jgroups.JChannel.up(JChannel.java:1484)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074)
at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderHandler.process(STREAMING_STATE_TRANSFER.java:651)
at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderThreadSpawner$1.run(STREAMING_STATE_TRANSFER.java:580)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-2,HOTROD1-57202(SITE1) could not obtain exclusive processing lock after 240 seconds. Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@28d51032[Read locks = 2] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@3c1a578f[Unlocked]
at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:100)
at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:192)
at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:155){code}
When running in hotrod server mode should rehashing be postponed until a new topology view has been established (and therefore any NamedCache's created) rather than commencing on a JGroups view change?
Any help appreciated, thanks.
[C2B2 Consulting|http://www.c2b2.co.uk]
-
existing-node-stack-trace.txt.zip 878 bytes
-
infinispan-config.xml 1.8 KB