2 Replies Latest reply on May 18, 2018 9:32 AM by anujshahwork

    JGroup 3.6.4 thread hanging when sending message

    caro82

      I'm using a Jgroups channel (jgroups version 3.6.4. Final), wildfly 9, java 8, solaris sparc.

      One of my application node has a hanging thread with the following stack, when trying to send down a message through this channel.

      My channel view has 4 members.


      "default task-3736" #13343 prio=5 os_prio=64 tid=0x0000000107028000 nid=0x3191 waiting on condition [0xfffffffc749fd000]
          java.lang.Thread.State: TIMED_WAITING (parking)
           at sun.misc.Unsafe.park(Native Method)
           - parking to wait for  <0x000000055eb67980> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
           at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
           at org.jgroups.util.CreditMap.decrement(CreditMap.java:146)
           at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:108)
           at org.jgroups.protocols.FlowControl.down(FlowControl.java:330)
           at org.jgroups.protocols.FRAG2.down(FRAG2.java:136)
           at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:202)
           at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:314)
           at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1038)
           at org.jgroups.JChannel.down(JChannel.java:791)
           at org.jgroups.JChannel.send(JChannel.java:426)
           at org.jgroups.JChannel.send(JChannel.java:431)


      I noticed 3 other threads parking for the same object.

      INT-4,DefaultPartition,lisachr01p-51213" #3594320 prio=5 os_prio=64 tid=0x0000000101ca6800 nid=0x36d5ed waiting on condition [0xfffffffc41cfd000]
          java.lang.Thread.State: TIMED_WAITING (parking)
           at sun.misc.Unsafe.park(Native Method)
           - parking to wait for  <0x000000055eb67980> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
           at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
           at org.jgroups.util.CreditMap.decrement(CreditMap.java:146)
           at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:108)
           at org.jgroups.protocols.FlowControl.down(FlowControl.java:330)
           at org.jgroups.protocols.FRAG2.down(FRAG2.java:136)
           at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:202)
           at org.jgroups.protocols.pbcast.FLUSH.onSuspend(FLUSH.java:708)
           at org.jgroups.protocols.pbcast.FLUSH.startFlush(FLUSH.java:226)
           at org.jgroups.protocols.pbcast.FLUSH.startFlush(FLUSH.java:216)
           at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:477)
           at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:146)
           at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
           at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
           at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
           at org.jgroups.protocols.pbcast.GMS._startFlush(GMS.java:822)
           at org.jgroups.protocols.pbcast.GMS.startFlush(GMS.java:794)
           at org.jgroups.protocols.pbcast.Merger._handleMergeRequest(Merger.java:283)
           at org.jgroups.protocols.pbcast.Merger.handleMergeRequest(Merger.java:102)
           at org.jgroups.protocols.pbcast.ServerGmsImpl.handleMergeRequest(ServerGmsImpl.java:28)
           at org.jgroups.protocols.pbcast.GMS.up(GMS.java:938)
           at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
           at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:639)
           at org.jgroups.protocols.UNICAST.up(UNICAST.java:394)
           at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:638)
           at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
           at org.jgroups.protocols.FD.up(FD.java:260)
           at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:311)
           at org.jgroups.protocols.MERGE2.up(MERGE2.java:237)
           at org.jgroups.protocols.Discovery.up(Discovery.java:295)
           at org.jgroups.protocols.TP.passMessageUp(TP.java:1577)
           at org.jgroups.protocols.TP$3.run(TP.java:1511)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:745)

      What could have triggered this behavior? Is this a jgroups bug. The issue occurred 2 times in the past 2 weeks. We've been running with this jgroup version for a year and haven't encountered any issues.

      I attached the jgroups channel configuration file.

        • 1. Re: JGroup 3.6.4 thread hanging when sending message
          pferraro

          This looks like a flow control issue.  JGroups will throttle multicast senders if receivers aren't able to cope with the message volume. See: Reliable group communication with JGroups

          • 2. Re: JGroup 3.6.4 thread hanging when sending message
            anujshahwork

            Are these threads hung indefinitely? If MFC is working as intended the thread should just be waiting until enough credits have been sent back from the other cluster members. Otherwise, I know of two race conditions that would result in a deadlock.

             

            One is the size of the internal thread pool, if all the threads are consumed and blocked by MFC then there are no available threads to process the credits coming through. In my case the threads were stuck handling a view change message which triggered a flush, just like the stack traces above.

             

            The second cause was in NAKACK2 which will push all up processing to a single thread that maybe stuck in the MFC protocol itself, again something upstream (like FLUSH) would have to trigger a down message as part of its handling which would be caught in MFC.

             

            In either case, you can manually break the deadlock by replenishing all credits in the MFC protocol, there is an unblock method to do this which is also available on the exposed MBean.