2 Replies Latest reply on Aug 23, 2015 10:55 AM by richmayfield

    Why does incoming message processing stop?

    richmayfield

      Infinispan 7.0.3-Final

       

      This one has me stumped. We have two nodes in our Infinispan cluster. All invalidating caches - we do not distribute the content. Occasionally one node stops processing incoming messages from the other. There are no errors logged or exceptions thrown.

      The node that is no longer processing messages always ends up waiting in StateTransferLockImpl.waitForTransactionData().

       

      "Incoming-2,lfvsfcp19626-23230" #289 prio=5 os_prio=0 tid=0x00002b3761e4c000 nid=0x8941 waiting on condition [0x00002b377c704000]
         java.lang.Thread.State: TIMED_WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000004d887bc00> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
              at org.infinispan.statetransfer.StateTransferLockImpl.waitForTransactionData(StateTransferLockImpl.java:90)
              at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:206)
              at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:86)
              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromLocalCluster(CommandAwareRpcDispatcher.java:267)
              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:211)
              at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:460)
              at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:377)
              at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:250)
              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:677)
              at org.jgroups.JChannel.up(JChannel.java:755)
              at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1033)
              at org.jgroups.protocols.RSVP.up(RSVP.java:237)
              at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
              at org.jgroups.protocols.FlowControl.up(FlowControl.java:447)
              at org.jgroups.protocols.FlowControl.up(FlowControl.java:447)
              at org.jgroups.stack.Protocol.up(Protocol.java:420)
              at org.jgroups.stack.Protocol.up(Protocol.java:420)
              at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
              at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1087)
              at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:886)
              at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:867)
              at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:517)
              at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:710)
              at org.jgroups.stack.Protocol.up(Protocol.java:420)
              at org.jgroups.protocols.FD.up(FD.java:274)
              at org.jgroups.stack.Protocol.up(Protocol.java:420)
              at org.jgroups.protocols.MERGE2.up(MERGE2.java:252)
              at org.jgroups.stack.Protocol.up(Protocol.java:420)
              at org.jgroups.protocols.TP.passBatchUp(TP.java:1605)
              at org.jgroups.protocols.TP$BatchHandler.run(TP.java:1855)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

       

      I have not configured these caches to be transactional. I cannot correlate these failures with any other activity. This is relatively infrequent - sometimes just once a day. Nonetheless, the only recovery appears to be restarting our application.

      Is this a known issue? Any ideas?

       

      Thanks so much

        • 1. Re: Why does incoming message processing stop?
          wdfink

          Could you add a bit more context, i.e. how the configuration look like and whether the problem occur if there is more load ...

          • 2. Re: Why does incoming message processing stop?
            richmayfield

            All of the caches are set up for invalidation. We configure the caches programmatically as follows:

            
            
            
            
            GlobalConfigurationBuilder globalConfigurationBuilder = GlobalConfigurationBuilder.defaultClusteredBuilder();
            
            
            builder.transport().addProperty("configurationXml", configAsString);
            
            
            ConfigurationBuilder configurationBuilder = new ConfigurationBuilder()
              .clustering().cacheMode(CacheMode.INVALIDATION_ASYNC)
              .clustering().stateTransfer().awaitInitialTransfer(false)
              .clustering().stateTransfer().fetchInMemoryState(false)
              .jmxStatistics().disable()
              .build();
            
            CacheManager cacheManager = new DefaultCacheManager(
               globalConfigurationBuilder.build(),
              configurationBuilder
            );
            

            Where the configAsString is

            
            <config xmlns="urn:org:jgroups"
                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                    xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">
                <TCP
                        bind_addr="${jgroups.tcp.address:127.0.0.1}"
                        bind_port="20082"
                        loopback="true"
                        port_range="30"
                        recv_buf_size="20m"
                        send_buf_size="640k"
                        max_bundle_size="31k"
                        use_send_queues="true"
                        enable_diagnostics="false"
                        bundler_type="old"
            
                        thread_naming_pattern="pl"
            
                        thread_pool.enabled="true"
                        thread_pool.min_threads="2"
                        thread_pool.max_threads="30"
                        thread_pool.keep_alive_time="60000"
                        thread_pool.queue_enabled="true"
                        thread_pool.queue_max_size="100"
                        thread_pool.rejection_policy="Discard"
            
                        oob_thread_pool.enabled="true"
                        oob_thread_pool.min_threads="2"
                        oob_thread_pool.max_threads="30"
                        oob_thread_pool.keep_alive_time="60000"
                        oob_thread_pool.queue_enabled="false"
                        oob_thread_pool.queue_max_size="100"
                        oob_thread_pool.rejection_policy="Discard"
            
                        internal_thread_pool.enabled="true"
                        internal_thread_pool.min_threads="1"
                        internal_thread_pool.max_threads="10"
                        internal_thread_pool.keep_alive_time="60000"
                        internal_thread_pool.queue_enabled="true"
                        internal_thread_pool.queue_max_size="100"
                        internal_thread_pool.rejection_policy="Discard"
                        />
            
                <FILE_PING timeout="3000"
                        location="/home/jgroups"
                     />
            
                <MERGE2 max_interval="30000" min_interval="10000"/>
            
                <FD_SOCK/>
                <FD timeout="3000" max_tries="5"/>
                <VERIFY_SUSPECT timeout="1500"/>
            
                <pbcast.NAKACK2 use_mcast_xmit="false"
                                xmit_interval="1000"
                                xmit_table_num_rows="100"
                                xmit_table_msgs_per_row="10000"
                                xmit_table_max_compaction_time="10000"
                                max_msg_batch_size="100"/>
                <UNICAST3 xmit_interval="500"
                          xmit_table_num_rows="20"
                          xmit_table_msgs_per_row="10000"
                          xmit_table_max_compaction_time="10000"
                          max_msg_batch_size="100"
                          conn_expiry_timeout="0"/>
            
                <pbcast.STABLE stability_delay="500" desired_avg_gossip="5000" max_bytes="1m"/>
                <pbcast.GMS print_local_addr="false" join_timeout="3000" view_bundling="true"/>
                <tom.TOA/> <!-- the TOA is only needed for total order transactions-->
            
                <UFC max_credits="2m" min_threshold="0.40"/>
                <MFC max_credits="2m" min_threshold="0.40"/>
                <FRAG2 frag_size="30k"/>
                <RSVP timeout="60000" resend_interval="500" ack_on_delivery="false" />
            </config>
            

            The caches are busy all the time. I have not correlated the failures with a time that is busier than any other time.