2 Replies Latest reply on Jan 18, 2012 3:57 AM by Thomas Peuss

    Not getting a stable cluster with TCP JGroups transport

    Thomas Peuss Newbie

      Hello!

       

      I am trying to establish a cluster (Infinispan 5.1.0CR3) among 8 nodes (Redhat Enterprise Linux 5.7 x86_64, Sun JDK 1.6.0_29 64bit, 24 cores, 32 GB RAM) with TCP transport (due to networking restrictions).

       

      My problem is that I cannot get a stable cluster. When I start the nodes at first everything seems OK and they all see each other. But after a minute or so they start to SUSPECT each other and the cluster falls apart. This happens even when there only a very small amount of traffic on the cluster.

       

      I am wondering about what can be the cause for the observed behaviour...

       

      Greetings

      Thomas

       

      Snippet from logfile (node 2):

      dataprocessing.log.1:11.01.2012 11:02:01,014 WARN  [OOB-1,Infinispan-Cluster,db-cassi-p2-30143] org.jgroups.protocols.FD: I was suspected by db-cassi-p5-59983; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      dataprocessing.log.1-11.01.2012 11:02:01,618 DEBUG [Timer-3,Infinispan-Cluster,db-cassi-p2-30143] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p3-6504 (own address=db-cassi-p2-30143)

      dataprocessing.log.1:11.01.2012 11:02:04,017 WARN  [OOB-2,Infinispan-Cluster,db-cassi-p2-30143] org.jgroups.protocols.FD: I was suspected by db-cassi-p5-59983; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

       

      Another snippet from logfile (node 2):

      dataprocessing.log.1-org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [ByteArrayKey{data=ByteArray{size=14, hashCode=5cca2df1, array=0x033e0b4144445245..}}] for requestor [GlobalTransactio

      dataprocessing.log.1-   at org.infinispan.util.concurrent.locks.LockManagerImpl.lock(LockManagerImpl.java:212)

      dataprocessing.log.1-   at org.infinispan.util.concurrent.locks.LockManagerImpl.acquireLock(LockManagerImpl.java:186)

      dataprocessing.log.1-   at org.infinispan.util.concurrent.locks.LockManagerImpl.acquireLock(LockManagerImpl.java:176)

      dataprocessing.log.1:   at org.infinispan.interceptors.locking.AbstractTxLockingInterceptor.lockKeyAndCheckOwnership(AbstractTxLockingInterceptor.java:172)

      dataprocessing.log.1:   at org.infinispan.interceptors.locking.AbstractTxLockingInterceptor.lockAndRegisterBackupLock(AbstractTxLockingInterceptor.java:131)

      dataprocessing.log.1:   at org.infinispan.interceptors.locking.OptimisticLockingInterceptor$LockAcquisitionVisitor.visitPutKeyValueCommand(OptimisticLockingInterceptor.java:208)

      dataprocessing.log.1-   at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)

      dataprocessing.log.1:   at org.infinispan.interceptors.locking.OptimisticLockingInterceptor.acquireLocksVisitingCommands(OptimisticLockingInterceptor.java:298)

      dataprocessing.log.1:   at org.infinispan.interceptors.locking.OptimisticLockingInterceptor.visitPrepareCommand(OptimisticLockingInterceptor.java:94)

      dataprocessing.log.1-   at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)

      dataprocessing.log.1-   at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)

      dataprocessing.log.1-   at org.infinispan.interceptors.NotificationInterceptor.visitPrepareCommand(NotificationInterceptor.java:58)

      dataprocessing.log.1-   at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)

      dataprocessing.log.1-   at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)

      dataprocessing.log.1-   at org.infinispan.interceptors.TxInterceptor.visitPrepareCommand(TxInterceptor.java:112)

      dataprocessing.log.1-   at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)

      dataprocessing.log.1-   at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)

      dataprocessing.log.1-   at org.infinispan.interceptors.StateTransferLockInterceptor.handleWithRetries(StateTransferLockInterceptor.java:201)

      dataprocessing.log.1-   at org.infinispan.interceptors.StateTransferLockInterceptor.visitPrepareCommand(StateTransferLockInterceptor.java:81)

      dataprocessing.log.1-   at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)

      dataprocessing.log.1-   at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)

      dataprocessing.log.1-   at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:130)

      dataprocessing.log.1-   at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:113)

      dataprocessing.log.1-   at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)

      dataprocessing.log.1-   at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)

      dataprocessing.log.1-   at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:116)

      dataprocessing.log.1-   at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:76)

      dataprocessing.log.1-   at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:113)

      dataprocessing.log.1-   at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)

      dataprocessing.log.1-   at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)

      dataprocessing.log.1-   at org.infinispan.interceptors.BatchingInterceptor.handleDefault(BatchingInterceptor.java:70)

      dataprocessing.log.1-   at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:113)

      dataprocessing.log.1-   at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:129)

      dataprocessing.log.1-   at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345)

      dataprocessing.log.1-   at org.infinispan.commands.tx.PrepareCommand.perform(PrepareCommand.java:125)

      dataprocessing.log.1-   at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:172)

      dataprocessing.log.1-   at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:181)

      dataprocessing.log.1-   at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithRetry(InboundInvocationHandlerImpl.java:207)

      dataprocessing.log.1-   at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:158)

      dataprocessing.log.1-   at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159)

      dataprocessing.log.1-   at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:139)

      dataprocessing.log.1-   at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:447)

      dataprocessing.log.1-   at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:354)

      dataprocessing.log.1-   at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:230)

      dataprocessing.log.1-   at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:556)

      dataprocessing.log.1-   at org.jgroups.JChannel.up(JChannel.java:718)

      dataprocessing.log.1-   at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1026)

      dataprocessing.log.1-   at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:178)

      dataprocessing.log.1-   at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)

      dataprocessing.log.1-   at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)

      dataprocessing.log.1-   at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)

      dataprocessing.log.1-   at org.jgroups.protocols.pbcast.GMS.up(GMS.java:881)

      dataprocessing.log.1-   at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:244)

      dataprocessing.log.1-   at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:618)

      dataprocessing.log.1-   at org.jgroups.protocols.UNICAST.up(UNICAST.java:317)

      dataprocessing.log.1-   at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:597)

      dataprocessing.log.1-   at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:140)

      dataprocessing.log.1-   at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:167)

      dataprocessing.log.1-   at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:284)

       

      Snippet from Logfile (node 5):

      11.01.2012 12:10:51,195 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)

      11.01.2012 12:10:54,199 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)

      11.01.2012 12:10:54,200 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=0)

      11.01.2012 12:10:57,204 DEBUG [Timer-4,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)

      11.01.2012 12:10:57,205 DEBUG [Timer-4,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=1)

      11.01.2012 12:11:00,207 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)

      11.01.2012 12:11:00,208 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: [db-cassi-p5-40550]: received no heartbeat ack from db-cassi-p2-57429 for 3 times (9000 milliseconds), suspecting

      11.01.2012 12:11:00,212 DEBUG [Timer-3,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group

      11.01.2012 12:11:03,213 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)

      11.01.2012 12:11:03,215 DEBUG [Timer-4,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group

      11.01.2012 12:11:03,215 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=0)

      11.01.2012 12:11:06,218 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)

      11.01.2012 12:11:06,218 DEBUG [Timer-3,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group

      11.01.2012 12:11:06,219 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=1)

      11.01.2012 12:11:09,222 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)

      11.01.2012 12:11:09,223 DEBUG [Timer-4,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group

      11.01.2012 12:11:09,223 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: [db-cassi-p5-40550]: received no heartbeat ack from db-cassi-p2-57429 for 3 times (9000 milliseconds), suspecting

      11.01.2012 12:11:12,225 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: sending are-you-alive msg to db-cassi-p2-57429 (own address=db-cassi-p5-40550)

      11.01.2012 12:11:12,226 DEBUG [Timer-3,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group

      11.01.2012 12:11:12,226 DEBUG [Timer-2,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: heartbeat missing from db-cassi-p2-57429 (number=0)

      11.01.2012 12:11:15,231 DEBUG [Timer-5,Infinispan-Cluster,db-cassi-p5-40550] org.jgroups.protocols.FD: broadcasting SUSPECT message [suspected_mbrs=[db-cassi-p2-57429]] to group

       

      Infinispan-Config:

      <infinispan

            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

            xsi:schemaLocation="urn:infinispan:config:5.0 http://www.infinispan.org/schemas/infinispan-config-5.0.xsd"

            xmlns="urn:infinispan:config:5.0">

       

         <global>

            <transport>

               <properties>

                  <property name="configurationFile" value="jgroups-config.xml" />

               </properties>

            </transport>

            <globalJmxStatistics enabled="true"/>

         </global>

       

         <default>

            <invocationBatching enabled="true" />

            <clustering mode="distribution">

               <l1 enabled="false" />

               <hash numOwners="2"/>

               <sync replTimeout="120000"/>

            </clustering>

            <jmxStatistics enabled="true"/>

            <locking concurrencyLevel="1000"/>

         </default>

      </infinispan>

       

      JGroups-Config:

      <config xmlns="urn:org:jgroups"

              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

              xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">

          <TCP_NIO

                  bind_port="7800"

                  use_send_queues="true"

                  sock_conn_timeout="300"

                  reader_threads="5"

                  writer_threads="5"

                  processor_threads="0"

                  processor_minThreads="0"

                  processor_maxThreads="0"

                  processor_queueSize="100"

                  processor_keepAliveTime="9223372036854775807"/>

          <TCPPING initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800],localhost[7801]}"/>

          <MERGE2/>

          <FD_SOCK/>

          <FD_ALL/>

          <VERIFY_SUSPECT/>

          <pbcast.NAKACK use_mcast_xmit="false"/>

          <UNICAST/>

          <pbcast.STABLE/>

          <pbcast.GMS/>

          <UFC/>

          <MFC/>

          <FRAG2/>

          <pbcast.STATE_TRANSFER />

      </config>

        • 1. Re: Not getting a stable cluster with TCP JGroups transport
          Bela Ban Master
          1. Don't use TCP_NIO (not supported), use TCP instead. In general copy the config from your version of Infinispan, there is a jgroups-tcp.xml config. I suggest copy it and modify it slightly.
          2. Use a bind_addr or set -Djgroups.bind_addr=1.2.3.4 (where 1.2.3.4 is a valid IP address). Note that you can also use site_local to make JGroups grab *any* local IP address
          3. Increase the timeout in FD_ALL, e.g. timeout="25000" interval="4000"
          4. Try this with the *same* config, but only JGroups (e.g. org.jgroups.demos.Draw). Run a cluster of 8 and leave it running for a day or so. You should *not* see *any* nodes leaving, or else you might have a network setup / configuration issue...
          1 of 1 people found this helpful
          • 2. Re: Not getting a stable cluster with TCP JGroups transport
            Thomas Peuss Newbie

            Hi Bela!

             

            Thanks for your tips!

             

            I got it working with following config copied from the Infinispan-jar as you suggested:

            <config xmlns="urn:org:jgroups"

                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                    xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">

               <TCP

                    bind_addr="${jgroups.tcp.address:127.0.0.1}"

                    bind_port="${jgroups.tcp.port:7800}"

                    loopback="true"

                    port_range="30"

                    recv_buf_size="20000000"

                    send_buf_size="640000"

                    discard_incompatible_packets="true"

                    max_bundle_size="64000"

                    max_bundle_timeout="30"

                    enable_bundling="true"

                    use_send_queues="true"

                    sock_conn_timeout="300"

                    enable_diagnostics="false"

             

                    thread_pool.enabled="true"

                    thread_pool.min_threads="2"

                    thread_pool.max_threads="30"

                    thread_pool.keep_alive_time="5000"

                    thread_pool.queue_enabled="false"

                    thread_pool.queue_max_size="100"

                    thread_pool.rejection_policy="Discard"

             

                    oob_thread_pool.enabled="true"

                    oob_thread_pool.min_threads="2"

                    oob_thread_pool.max_threads="30"

                    oob_thread_pool.keep_alive_time="5000"

                    oob_thread_pool.queue_enabled="false"

                    oob_thread_pool.queue_max_size="100"

                    oob_thread_pool.rejection_policy="Discard"       

                     />

               <TCPPING timeout="3000"

                        initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800],localhost[7801]}"

                        port_range="1"

                        num_initial_members="3"

                        ergonomics="false"/>

             

               <MERGE2 max_interval="30000"

                       min_interval="10000"/>

               <FD_SOCK/>

               <FD timeout="3000" max_tries="3"/>

               <VERIFY_SUSPECT timeout="1500"/>

               <pbcast.NAKACK

                     use_mcast_xmit="false"

                     retransmit_timeout="300,600,1200,2400,4800"

                     discard_delivered_msgs="false"/>

               <UNICAST2 timeout="300,600,1200"/>

               <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

                              max_bytes="400000"/>

               <pbcast.GMS print_local_addr="false" join_timeout="7000" view_bundling="true"/>

               <UFC max_credits="2000000" min_threshold="0.10"/>

               <MFC max_credits="2000000" min_threshold="0.10"/>

               <FRAG2 frag_size="60000"/>

            </config>