Not getting a stable cluster with TCP JGroups transport
derry Jan 17, 2012 10:58 AMHello!
I am trying to establish a cluster (Infinispan 5.1.0CR3) among 8 nodes (Redhat Enterprise Linux 5.7 x86_64, Sun JDK 1.6.0_29 64bit, 24 cores, 32 GB RAM) with TCP transport (due to networking restrictions).
My problem is that I cannot get a stable cluster. When I start the nodes at first everything seems OK and they all see each other. But after a minute or so they start to SUSPECT each other and the cluster falls apart. This happens even when there only a very small amount of traffic on the cluster.
I am wondering about what can be the cause for the observed behaviour...
Greetings
Thomas
Snippet from logfile (node 2):
dataprocessing.log.1:11.01.2012 11:02:01,014 WARN [OOB-1,Infinispan-Cluster,db-cassi-p2-30143] org.jgroups.protocols.FD: I was suspected by db-cassi-p5-59983; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
dataprocessing.log.1-11.01.2012 11:02:01,618 DEBUG [Timer-3,Infinispan-Cluster,db-cassi-p2-30143] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p3-6504 (own address=db-cassi-p2-30143)
dataprocessing.log.1:11.01.2012 11:02:04,017 WARN [OOB-2,Infinispan-Cluster,db-cassi-p2-30143] org.jgroups.protocols.FD: I was suspected by db-cassi-p5-59983; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
Another snippet from logfile (node 2):
dataprocessing.log.1-org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [ByteArrayKey{data=ByteArray{size=14, hashCode=5cca2df1, array=0x033e0b4144445245..}}] for requestor [GlobalTransactio
dataprocessing.log.1- at org.infinispan.util.concurrent.locks.LockManagerImpl.lock(LockManagerImpl.java:212)
dataprocessing.log.1- at org.infinispan.util.concurrent.locks.LockManagerImpl.acquireLock(LockManagerImpl.java:186)
dataprocessing.log.1- at org.infinispan.util.concurrent.locks.LockManagerImpl.acquireLock(LockManagerImpl.java:176)
dataprocessing.log.1: at org.infinispan.interceptors.locking.AbstractTxLockingInterceptor.lockKeyAndCheckOwnership(AbstractTxLockingInterceptor.java:172)
dataprocessing.log.1: at org.infinispan.interceptors.locking.AbstractTxLockingInterceptor.lockAndRegisterBackupLock(AbstractTxLockingInterceptor.java:131)
dataprocessing.log.1: at org.infinispan.interceptors.locking.OptimisticLockingInterceptor$LockAcquisitionVisitor.visitPutKeyValueCommand(OptimisticLockingInterceptor.java:208)
dataprocessing.log.1- at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
dataprocessing.log.1: at org.infinispan.interceptors.locking.OptimisticLockingInterceptor.acquireLocksVisitingCommands(OptimisticLockingInterceptor.java:298)
dataprocessing.log.1: at org.infinispan.interceptors.locking.OptimisticLockingInterceptor.visitPrepareCommand(OptimisticLockingInterceptor.java:94)
dataprocessing.log.1- at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)
dataprocessing.log.1- at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
dataprocessing.log.1- at org.infinispan.interceptors.NotificationInterceptor.visitPrepareCommand(NotificationInterceptor.java:58)
dataprocessing.log.1- at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)
dataprocessing.log.1- at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
dataprocessing.log.1- at org.infinispan.interceptors.TxInterceptor.visitPrepareCommand(TxInterceptor.java:112)
dataprocessing.log.1- at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)
dataprocessing.log.1- at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
dataprocessing.log.1- at org.infinispan.interceptors.StateTransferLockInterceptor.handleWithRetries(StateTransferLockInterceptor.java:201)
dataprocessing.log.1- at org.infinispan.interceptors.StateTransferLockInterceptor.visitPrepareCommand(StateTransferLockInterceptor.java:81)
dataprocessing.log.1- at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)
dataprocessing.log.1- at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
dataprocessing.log.1- at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:130)
dataprocessing.log.1- at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:113)
dataprocessing.log.1- at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)
dataprocessing.log.1- at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
dataprocessing.log.1- at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:116)
dataprocessing.log.1- at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:76)
dataprocessing.log.1- at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:113)
dataprocessing.log.1- at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)
dataprocessing.log.1- at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
dataprocessing.log.1- at org.infinispan.interceptors.BatchingInterceptor.handleDefault(BatchingInterceptor.java:70)
dataprocessing.log.1- at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:113)
dataprocessing.log.1- at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)
dataprocessing.log.1- at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345)
dataprocessing.log.1- at org.infinispan.commands.tx.PrepareCommand.perform(PrepareCommand.java:125)
dataprocessing.log.1- at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:172)
dataprocessing.log.1- at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:181)
dataprocessing.log.1- at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithRetry(InboundInvocationHandlerImpl.java:207)
dataprocessing.log.1- at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:158)
dataprocessing.log.1- at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159)
dataprocessing.log.1- at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:139)
dataprocessing.log.1- at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:447)
dataprocessing.log.1- at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:354)
dataprocessing.log.1- at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:230)
dataprocessing.log.1- at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:556)
dataprocessing.log.1- at org.jgroups.JChannel.up(JChannel.java:718)
dataprocessing.log.1- at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1026)
dataprocessing.log.1- at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:178)
dataprocessing.log.1- at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
dataprocessing.log.1- at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
dataprocessing.log.1- at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
dataprocessing.log.1- at org.jgroups.protocols.pbcast.GMS.up(GMS.java:881)
dataprocessing.log.1- at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:244)
dataprocessing.log.1- at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:618)
dataprocessing.log.1- at org.jgroups.protocols.UNICAST.up(UNICAST.java:317)
dataprocessing.log.1- at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:597)
dataprocessing.log.1- at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:140)
dataprocessing.log.1- at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:167)
dataprocessing.log.1- at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:284)
Snippet from Logfile (node 5):
11.01.2012 12:10:51,195 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)
11.01.2012 12:10:54,199 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)
11.01.2012 12:10:54,200 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=0)
11.01.2012 12:10:57,204 DEBUG [Timer-4,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)
11.01.2012 12:10:57,205 DEBUG [Timer-4,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=1)
11.01.2012 12:11:00,207 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)
11.01.2012 12:11:00,208 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: [db-cassi-p5-40550]: received no heartbeat ack from db-cassi-p2-57429 for 3 times (9000 milliseconds), suspecting
11.01.2012 12:11:00,212 DEBUG [Timer-3,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group
11.01.2012 12:11:03,213 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)
11.01.2012 12:11:03,215 DEBUG [Timer-4,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group
11.01.2012 12:11:03,215 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=0)
11.01.2012 12:11:06,218 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)
11.01.2012 12:11:06,218 DEBUG [Timer-3,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group
11.01.2012 12:11:06,219 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=1)
11.01.2012 12:11:09,222 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)
11.01.2012 12:11:09,223 DEBUG [Timer-4,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group
11.01.2012 12:11:09,223 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: [db-cassi-p5-40550]: received no heartbeat ack from db-cassi-p2-57429 for 3 times (9000 milliseconds), suspecting
11.01.2012 12:11:12,225 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)
11.01.2012 12:11:12,226 DEBUG [Timer-3,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group
11.01.2012 12:11:12,226 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=0)
11.01.2012 12:11:15,231 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group
Infinispan-Config:
<infinispan
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:infinispan:config:5.0 http://www.infinispan.org/schemas/infinispan-config-5.0.xsd"
xmlns="urn:infinispan:config:5.0">
<global>
<transport>
<properties>
<property name="configurationFile" value="jgroups-config.xml" />
</properties>
</transport>
<globalJmxStatistics enabled="true"/>
</global>
<default>
<invocationBatching enabled="true" />
<clustering mode="distribution">
<l1 enabled="false" />
<hash numOwners="2"/>
<sync replTimeout="120000"/>
</clustering>
<jmxStatistics enabled="true"/>
<locking concurrencyLevel="1000"/>
</default>
</infinispan>
JGroups-Config:
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">
<TCP_NIO
bind_port="7800"
use_send_queues="true"
sock_conn_timeout="300"
reader_threads="5"
writer_threads="5"
processor_threads="0"
processor_minThreads="0"
processor_maxThreads="0"
processor_queueSize="100"
processor_keepAliveTime="9223372036854775807"/>
<TCPPING initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800],localhost[7801]}"/>
<MERGE2/>
<FD_SOCK/>
<FD_ALL/>
<VERIFY_SUSPECT/>
<pbcast.NAKACK use_mcast_xmit="false"/>
<UNICAST/>
<pbcast.STABLE/>
<pbcast.GMS/>
<UFC/>
<MFC/>
<FRAG2/>
<pbcast.STATE_TRANSFER />
</config>