3 Replies Latest reply on Jul 5, 2011 2:18 PM by dan.berindei

    Issues Scaling Hotrod Cluster

    markaddy

      Hi,

       

      When trying to add additional nodes to a running hotrod server cluster under load I am seeing the problems described below.

      We are running 5.0.0.CR7 in distributed async mode, the hotrod server configuration file is attached.

       

      The issue can be reproduced as follows:

       

      1) Start 2 hotrod servers

      2) Begin putting entries into the cache at a sustained rate

      3) Start another hotrod server

       

      When the third server starts a cluster view is created and a rehash of the data commences on the first two nodes.  However the following stack trace is see in the original two server logs:

       

      {code}2011-07-02 21:14:36,097 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (AsyncSerializationThread-1) ISPN00073: Unexpected error while replicating

      org.infinispan.manager.NamedCacheNotFoundException: Cannot process command RehashControlCommand{type=APPLY_STATE, sender=HOTROD1-57202(SITE1), viewId=2, state=2477, oldConsistentHash=TopologyAwareConsistentHash {1115: HOTROD1-57202(SITE1), 6597: HOTROD2-49193(SITE1)}, consistentHash=TopologyAwareConsistentHash {93: HOTROD3-971(SITE1), 1115: HOTROD1-57202(SITE1), 6597: HOTROD2-49193(SITE1)}} on node HOTROD3-971(SITE1) Cache: testCache

              at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:153)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:144){code}

       

      On the joining node these corresponding messages are seen in the logs:

       

      {code}2011-07-02 21:14:04,442 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) ISPN00094: Received new cluster view: [HOTROD1-57202(SITE1)|2] [HOTROD1-57202(SITE1), HOTROD2-49193(SITE1), HOTROD3-971(SITE1)]

      2011-07-02 21:14:05,220 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) ISPN00079: Cache local address is HOTROD3-971(SITE1), physical addresses are [192.168.152.139:48528]

      2011-07-02 21:14:06,017 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-6,HOTROD3-971(SITE1)) ISPN00067: Will try and wait for the cache to start

      2011-07-02 21:14:06,027 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,HOTROD3-971(SITE1)) ISPN00067: Will try and wait for the cache to start

      2011-07-02 21:14:06,361 INFO  [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) ISPN00074: Trying to fetch state from HOTROD1-57202(SITE1)

      2011-07-02 21:14:36,071 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-6,HOTROD3-971(SITE1)) ISPN00068: Cache named testCache does not exist on this cache manager!

      2011-07-02 21:14:36,127 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,HOTROD3-971(SITE1)) ISPN00068: Cache named testCache does not exist on this cache manager!

      2011-07-02 21:15:06,445 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (Incoming-33,HOTROD3-971(SITE1)) ISPN00067: Will try and wait for the cache to start

      2011-07-02 21:15:36,453 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (Incoming-33,HOTROD3-971(SITE1)) ISPN00068: Cache named testCache does not exist on this cache manager!{code}

       

      The rehash is failing because the startup of the joining node is blocked.  Creation of the hotrod topology cache "___hotRodTopologyCache" is unable to complete therefore preventing creation of our named cache.

       

      I took stack traces from the joining node which show the InfinispanServer-Main thread waiting for the ___hotRodTopologyCache state transfer to complete (joining-node-stack-trace.txt) and also from the exisiting node providing the state (existing-node-stack-trace.txt).

       

      The existing node is attempting to obtain an exclusive cluster wide lock (JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:100) before providing topology state to the joiner - this can't happen as read locks are present while the clients are still putting entries into the cache.

       

      Eventually the state transfer times out on the existing node and the cluster is also unusable.

       

      {code}2011-07-02 21:18:06,466 ERROR [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (STREAMING_STATE_TRANSFER-sender-2,HOTROD1-57202(SITE1)) ISPN00095: Caught while responding to state transfer request

      org.infinispan.statetransfer.StateTransferException: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-2,HOTROD1-57202(SITE1) could not obtain exclusive processing lock after 240 seconds.  Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@28d51032[Read locks = 2] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@3c1a578f[Unlocked]

              at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:165)

              at org.infinispan.remoting.InboundInvocationHandlerImpl.generateState(InboundInvocationHandlerImpl.java:248)

              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.getState(JGroupsTransport.java:587)

              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:690)

              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)

              at org.jgroups.JChannel.up(JChannel.java:1484)

              at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074)

              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderHandler.process(STREAMING_STATE_TRANSFER.java:651)

              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderThreadSpawner$1.run(STREAMING_STATE_TRANSFER.java:580)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:662)

      Caused by: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-2,HOTROD1-57202(SITE1) could not obtain exclusive processing lock after 240 seconds.  Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@28d51032[Read locks = 2] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@3c1a578f[Unlocked]

              at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:100)

              at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:192)

              at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:155){code}

       

      When running in hotrod server mode should rehashing be postponed until a new topology view has been established (and therefore any NamedCache's created) rather than commencing on a JGroups view change?

       

      Any help appreciated, thanks.

       

      [C2B2 Consulting|http://www.c2b2.co.uk]

        • 1. Re: Issues Scaling Hotrod Cluster
          galder.zamarreno

          Mark, thx for the information provided so far. Do you have by any chance the rest of the thread dump from the "existing-node"? Clearly, something in that JVM is holding the processing lock that's stopping the node from serving the state for the topology cache and so the rest of the thread dump should provide that information.

           

          At first glance, as you hinted, the operations on the distributed might be getting hold of the processing lock (I need to see the rest of the thread dump in the existing node, but that'd be my first suspicion cos the topology cache is only updated when a node joins or when it leaves/crashes. In this case the update for the node joining happens *after* the state transfer has been done, so this is not problematic here) and that's causing the state transfer for the topology cache not to succeed. The key thing here is that the states of these two caches are disjoints, so we'd need to see if this can be optimised.

           

          To confirm we're on the right track, please provide the rest of the existing node's thread dump.

           

          Also, as one of my colleagues pointed out, the time that's waited for a cache to be present in the receiving side has to potentially take into account the fact that state transfer might be involved in another cache, although this is a bit subjective and varying depending on the configuration of the other caches and the order in which they're started.

           

          Cheers

          • 2. Re: Issues Scaling Hotrod Cluster
            markaddy

            Galder,

             

            Thanks for the reply, I have attached full stack traces from the joiner and existing node.

             

            Mark

             

            [C2B2 Consulting|http://www.c2b2.co.uk]

            • 3. Re: Issues Scaling Hotrod Cluster
              dan.berindei

              Mark, it looks like your initial diagnosis was correct.

               

              I was the one who suggested to Galder that you might want to decrease the timeouts for state transfer - that may have helped if the deadlock was related only to the topology cache, but in your case it won't do anything.

               

              The sequence of events on the existing node probably went like this:

              1. Received view change

              2. Rehashing thread acquired TransactionLoggerImpl.txLock for write

              3. Incoming put requests acquire JGroupsDistSync.processingLock for read

              4. Incoming put requests try to acquire TransactionLoggerImpl.txLock for read - most incoming threads are waiting here

              4. Received state request from joiner

              5. State transfer closed JGroupsDistSync.flushBlockGate - there is one incoming thread waiting here too

              6. State transfer now waits to acquire JGroupsDistSync.processingLock for write, but it can't because all the incoming threads blocked waiting for TransactionLoggerImpl.txLock have already acquired JGroupsDistSync.processingLock.

               

              For 5.0 I don't think we can follow your suggestion of waiting to rehash until the joiner has finished receiving replicated state, that would be too big a change. However I think we could stop acquiring JGroupsDistSync.processingLock at all in the incoming InboundInvocationHandlerImpl if the target cache is distributed - that should allow state transfer to go through and the put requests in the distributed cache are blocked anyway. Galder, what do you think?

               

              For 5.1 I'm working on a consolidation of the state transfer code with the rehashing code, which will hopefully mean that each cache will get its own transaction lock.