2 Replies Latest reply on Nov 10, 2017 12:18 PM by Anuj Shah

    JGroup 3.6.4 thread hanging when sending message

    Carolina Contiu Newbie

      I'm using a Jgroups channel (jgroups version 3.6.4. Final), wildfly 9, java 8, solaris sparc.

      One of my application node has a hanging thread with the following stack, when trying to send down a message through this channel.

      My channel view has 4 members.


      "default task-3736" #13343 prio=5 os_prio=64 tid=0x0000000107028000 nid=0x3191 waiting on condition [0xfffffffc749fd000]
          java.lang.Thread.State: TIMED_WAITING (parking)
           at sun.misc.Unsafe.park(Native Method)
           - parking to wait for  <0x000000055eb67980> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
           at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
           at org.jgroups.util.CreditMap.decrement(CreditMap.java:146)
           at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:108)
           at org.jgroups.protocols.FlowControl.down(FlowControl.java:330)
           at org.jgroups.protocols.FRAG2.down(FRAG2.java:136)
           at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:202)
           at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:314)
           at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1038)
           at org.jgroups.JChannel.down(JChannel.java:791)
           at org.jgroups.JChannel.send(JChannel.java:426)
           at org.jgroups.JChannel.send(JChannel.java:431)


      I noticed 3 other threads parking for the same object.

      INT-4,DefaultPartition,lisachr01p-51213" #3594320 prio=5 os_prio=64 tid=0x0000000101ca6800 nid=0x36d5ed waiting on condition [0xfffffffc41cfd000]
          java.lang.Thread.State: TIMED_WAITING (parking)
           at sun.misc.Unsafe.park(Native Method)
           - parking to wait for  <0x000000055eb67980> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
           at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
           at org.jgroups.util.CreditMap.decrement(CreditMap.java:146)
           at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:108)
           at org.jgroups.protocols.FlowControl.down(FlowControl.java:330)
           at org.jgroups.protocols.FRAG2.down(FRAG2.java:136)
           at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:202)
           at org.jgroups.protocols.pbcast.FLUSH.onSuspend(FLUSH.java:708)
           at org.jgroups.protocols.pbcast.FLUSH.startFlush(FLUSH.java:226)
           at org.jgroups.protocols.pbcast.FLUSH.startFlush(FLUSH.java:216)
           at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:477)
           at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:146)
           at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
           at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
           at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
           at org.jgroups.protocols.pbcast.GMS._startFlush(GMS.java:822)
           at org.jgroups.protocols.pbcast.GMS.startFlush(GMS.java:794)
           at org.jgroups.protocols.pbcast.Merger._handleMergeRequest(Merger.java:283)
           at org.jgroups.protocols.pbcast.Merger.handleMergeRequest(Merger.java:102)
           at org.jgroups.protocols.pbcast.ServerGmsImpl.handleMergeRequest(ServerGmsImpl.java:28)
           at org.jgroups.protocols.pbcast.GMS.up(GMS.java:938)
           at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
           at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:639)
           at org.jgroups.protocols.UNICAST.up(UNICAST.java:394)
           at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:638)
           at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
           at org.jgroups.protocols.FD.up(FD.java:260)
           at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:311)
           at org.jgroups.protocols.MERGE2.up(MERGE2.java:237)
           at org.jgroups.protocols.Discovery.up(Discovery.java:295)
           at org.jgroups.protocols.TP.passMessageUp(TP.java:1577)
           at org.jgroups.protocols.TP$3.run(TP.java:1511)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:745)

      What could have triggered this behavior? Is this a jgroups bug. The issue occurred 2 times in the past 2 weeks. We've been running with this jgroup version for a year and haven't encountered any issues.

      I attached the jgroups channel configuration file.

        • 1. Re: JGroup 3.6.4 thread hanging when sending message
          Paul Ferraro Master

          This looks like a flow control issue.  JGroups will throttle multicast senders if receivers aren't able to cope with the message volume. See: Reliable group communication with JGroups

          • 2. Re: JGroup 3.6.4 thread hanging when sending message
            Anuj Shah Newbie

            Are these thread hung indefinitely? If MFC is working as intended the thread should just be waiting until enough credit have been sent back from the other cluster members. Otherwise, I know of two race conditions that would result in a deadlock.

             

            One is the size of the internal thread pool, if all the threads are consumed and blocked by MFC then there are no available threads to process the credits coming through. In my case the threads were stuck handling a view change message which triggered a flush, just like the stack traces above.

             

            The second cause was in NAKACK2 which will push all up processing to a single thread that maybe stuck in the MFC protocol itself, again something upstream (like FLUSH) would have to trigger a down message as part of its handling which would be caught in MFC.

             

            In either case, you can manually break the deadlock by replenishing all credits in the MFC protocol, there is an unblock method to do this which is also available no the exposed MBean.