3 Replies Latest reply on Jan 30, 2013 1:28 AM by swapnath

    Cluster falls apart: FD_SOCK errors

    kennardconsulting

      Guys,

      We are seeing these warnings on our cluster nodes sometimes after a redeploy. The primary cluster node stays up,
      but the other 2 (we have 3) die and keep saying this.

      2009-10-01 20:04:34,302 main WARN [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
      2009-10-01 20:04:37,315 main WARN [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
      2009-10-01 20:04:40,318 main WARN [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
      2009-10-01 20:04:43,321 main WARN [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
      2009-10-01 20:04:46,323 main WARN [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
      2009-10-01 20:04:49,326 main WARN [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
      2009-10-01 20:04:52,329 main WARN [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
      2009-10-01 20:04:55,332 main WARN [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying


      They keep saying this even after a restart. Only restarting the primary node fixes things. We can see 192.168.1.1:52187 is listening on the primary with netstat -panu

      On that server, the logs say

      192.168.1.2:37802 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 192.168.1.2:37802 could not be fetched, retrying.


      However, on the 192.168.1.2 box, we can see 192.168.1.2:37802 is also listening.

      We can ping and telnet to tcp ports on both boxes and between each box, and ipconfig shows no packet losses, collisions or other
      networking errors. Also a restart of JBoss on the primary fixes the problem.

      Therefore the problem seems to be something within JBoss.

      On the primary server there are a number of threads referring to port 52187 and none of these seem locked up (as far as I can see).

      Here are some of the threads from the Primary:

      This looks like the server socket:

      "FD_SOCK server socket acceptor,192.168.1.1:52187" daemon prio=10 tid=0x0ac19800 nid=0x6281 runnable [0x66293000]
       java.lang.Thread.State: RUNNABLE
       at java.net.PlainSocketImpl.socketAccept(Native Method)
       at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
       - locked <0xab976ea0> (a java.net.SocksSocketImpl)
       at java.net.ServerSocket.implAccept(ServerSocket.java:453)
       at java.net.ServerSocket.accept(ServerSocket.java:421)
       at org.jgroups.protocols.FD_SOCK$ServerSocketHandler.run(FD_SOCK.java:1022)
       at java.lang.Thread.run(Thread.java:619)



      This looks like the connection handler:

      "FD_SOCK client connection handler,AvantProduction,192.168.1.1:52187" daemon prio=10 tid=0x097e8000 nid=0x1a6c runnable [0x62c
      71000]
       java.lang.Thread.State: RUNNABLE
       at java.net.SocketInputStream.socketRead0(Native Method)
       at java.net.SocketInputStream.read(SocketInputStream.java:129)
       at java.net.SocketInputStream.read(SocketInputStream.java:182)
       at org.jgroups.protocols.FD_SOCK$ClientConnectionHandler.run(FD_SOCK.java:1089)
       at java.lang.Thread.run(Thread.java:619)



      And these are threads waiting in a pool by the look of it:

      "Incoming-14,192.168.1.1:52187" prio=10 tid=0x091c9400 nid=0x5f7f waiting on condition [0x68e1f000]
       java.lang.Thread.State: WAITING (parking)
       at sun.misc.Unsafe.park(Native Method)
       - parking to wait for <0x8b246af0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
       at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
       at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
       at java.lang.Thread.run(Thread.java:619)
      
      "Incoming-13,192.168.1.1:52187" prio=10 tid=0x091c7c00 nid=0x5f7e waiting on condition [0x68e70000]
       java.lang.Thread.State: WAITING (parking)
       at sun.misc.Unsafe.park(Native Method)
       - parking to wait for <0x8b246af0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
       at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
       at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
       at java.lang.Thread.run(Thread.java:619)



      Here is a list of all threads referring to port 52187:

      "VERIFY_SUSPECT.TimerThread,AvantProduction-HAPartitionCache,192.168.1.1:52187" daemon prio=10 tid=0x0b838400 nid=0x1d9d waiting on condition [0x5f1fe000]
      "OOB-680,192.168.1.1:52187" prio=10 tid=0x73163400 nid=0x1d9c waiting on condition [0x66c48000]
      "OOB-678,192.168.1.1:52187" prio=10 tid=0x6ef51800 nid=0x1c8e waiting on condition [0x66a62000]
      "FD_SOCK client connection handler,AvantProduction,192.168.1.1:52187" daemon prio=10 tid=0x097e8000 nid=0x1a6c runnable [0x62c71000]
      "FD_SOCK pinger,AvantProduction-HAPartitionCache,192.168.1.1:52187" daemon prio=10 tid=0x092ba400 nid=0x69b7 waiting on condition [0x5f15c000]
      "ViewHandler,AvantProduction-HAPartitionCache,192.168.1.1:52187" prio=10 tid=0x0c48d400 nid=0x6506 waiting on condition [0x5f1ad000]
      "FD_SOCK server socket acceptor,192.168.1.1:52187" daemon prio=10 tid=0x0ac19800 nid=0x6281 runnable [0x66293000]
      "Incoming-20,192.168.1.1:52187" prio=10 tid=0x71df4400 nid=0x5f8d waiting on condition [0x68a02000]
      "Incoming-19,192.168.1.1:52187" prio=10 tid=0x091ed400 nid=0x5f89 waiting on condition [0x68b46000]
      "Incoming-18,192.168.1.1:52187" prio=10 tid=0x7161dc00 nid=0x5f88 waiting on condition [0x68b97000]
      "Incoming-17,192.168.1.1:52187" prio=10 tid=0x091ecc00 nid=0x5f84 waiting on condition [0x68c8a000]
      "Incoming-16,192.168.1.1:52187" prio=10 tid=0x72a4a800 nid=0x5f83 waiting on condition [0x68cdb000]
      "Incoming-15,192.168.1.1:52187" prio=10 tid=0x091ca400 nid=0x5f80 waiting on condition [0x68dce000]
      "Incoming-14,192.168.1.1:52187" prio=10 tid=0x091c9400 nid=0x5f7f waiting on condition [0x68e1f000]
      "Incoming-13,192.168.1.1:52187" prio=10 tid=0x091c7c00 nid=0x5f7e waiting on condition [0x68e70000]
      "Incoming-12,192.168.1.1:52187" prio=10 tid=0x08f7ec00 nid=0x5f7d waiting on condition [0x68ec1000]
      "Incoming-11,192.168.1.1:52187" prio=10 tid=0x6d3f8400 nid=0x5f7c waiting on condition [0x68f12000]
      "Incoming-10,192.168.1.1:52187" prio=10 tid=0x091c5800 nid=0x5f7b waiting on condition [0x68f63000]
      "Incoming-9,192.168.1.1:52187" prio=10 tid=0x713f5800 nid=0x5f7a waiting on condition [0x68fb4000]
      "Incoming-8,192.168.1.1:52187" prio=10 tid=0x09cbe000 nid=0x5f79 waiting on condition [0x69005000]
      "Timer-12,192.168.1.1:52187" daemon prio=10 tid=0x713e5800 nid=0x5f78 waiting on condition [0x69056000]
      "Timer-11,192.168.1.1:52187" daemon prio=10 tid=0x6cdf8800 nid=0x5f77 waiting on condition [0x690a7000]
      "Timer-10,192.168.1.1:52187" daemon prio=10 tid=0x71471c00 nid=0x5f75 waiting on condition [0x69149000]
      "Incoming-7,192.168.1.1:52187" prio=10 tid=0x09cbac00 nid=0x5f74 waiting on condition [0x6919a000]
      "Incoming-6,192.168.1.1:52187" prio=10 tid=0x08f7f800 nid=0x5f72 waiting on condition [0x6923c000]
      "Timer-9,192.168.1.1:52187" daemon prio=10 tid=0x08d8a800 nid=0x5f71 waiting on condition [0x6928d000]
      "Timer-8,192.168.1.1:52187" daemon prio=10 tid=0x08fd1000 nid=0x5f6f waiting on condition [0x6932f000]
      "Timer-7,192.168.1.1:52187" daemon prio=10 tid=0x0927f800 nid=0x5f6d waiting on condition [0x693d1000]
      "Incoming-5,192.168.1.1:52187" prio=10 tid=0x091ae800 nid=0x5f6c waiting on condition [0x69422000]
      "Incoming-4,192.168.1.1:52187" prio=10 tid=0x091ad400 nid=0x5f6b waiting on condition [0x69473000]
      "Incoming-3,192.168.1.1:52187" prio=10 tid=0x08e98800 nid=0x5f6a waiting on condition [0x694c4000]
      "Incoming-2,192.168.1.1:52187" prio=10 tid=0x08e97800 nid=0x5f69 waiting on condition [0x69515000]
      "Incoming-1,192.168.1.1:52187" prio=10 tid=0x08ee5800 nid=0x5f68 waiting on condition [0x69566000]
      "Timer-6,192.168.1.1:52187" daemon prio=10 tid=0x717d2800 nid=0x5f67 waiting on condition [0x695b7000]
      "Timer-5,192.168.1.1:52187" daemon prio=10 tid=0x712ff000 nid=0x5f66 waiting on condition [0x69608000]
      "Timer-4,192.168.1.1:52187" daemon prio=10 tid=0x711fd800 nid=0x5f64 waiting on condition [0x696aa000]
      "Timer-3,192.168.1.1:52187" daemon prio=10 tid=0x09297800 nid=0x5f62 waiting on condition [0x696fb000]
      "Timer-2,192.168.1.1:52187" daemon prio=10 tid=0x090a6800 nid=0x5f5f waiting on condition [0x697ee000]
      "UDP mcast,192.168.1.1:52187" prio=10 tid=0x09956c00 nid=0x5f5e runnable [0x6983f000]
      "UDP ucast,192.168.1.1:52187" prio=10 tid=0x09956800 nid=0x5f5d runnable [0x69890000]
      "DiagnosticsHandler,192.168.1.1:52187" daemon prio=10 tid=0x08ee6c00 nid=0x5f5c runnable [0x698e1000]
      "Timer-1,192.168.1.1:52187" daemon prio=10 tid=0x08e56c00 nid=0x5f5a waiting on condition [0x69983000]


      This is really hurting our up-time. Could anybody point us in the right direction?

      Regards,

      Richard.

        • 1. Re: Cluster falls apart: FD_SOCK errors
          kennardconsulting

          A tip for those who come this way after us: we found a large part of the problem was that the cluster nodes rely on being in constant communication.

          If one of them is under high load (say, running some reports or something) its CPU usage may be so high it does not respond to the cluster ping quickly enough (within 3 seconds). The cluster then treats it as dead and removes it from the cluster, even though it is not dead it is just busy.

          We increased the org.jgroups.protocols.pbcast.GMS timeout and it helped a great deal.

          • 2. Re: Cluster falls apart: FD_SOCK errors
            payne51558

            Richard..did you ever resolve this?  Is there an easy way to tell which server is the primary node in a cluster?  I am having the same issue of late.

             

             

            2010-10-21 16:11:27,987 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (main) Initializing partition DefaultPartition
            2010-10-21 16:11:28,076 INFO  [STDOUT] (JBoss System Threads(1)-3)
            ---------------------------------------------------------
            GMS: address is 10.230.3.15:56646 (cluster=DefaultPartition)
            ---------------------------------------------------------
            2010-10-21 16:11:28,173 INFO  [org.jboss.cache.jmx.PlatformMBeanServerRegistration] (main) JBossCache MBeans were successfully registered to the platform mbean server.
            2010-10-21 16:11:28,235 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Number of cluster members: 15
            2010-10-21 16:11:28,237 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Other members: 14
            2010-10-21 16:11:28,287 INFO  [STDOUT] (main)
            ---------------------------------------------------------
            GMS: address is 10.230.3.15:56646 (cluster=DefaultPartition-HAPartitionCache)
            ---------------------------------------------------------
            2010-10-21 16:11:31,294 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:56646) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
            2010-10-21 16:11:34,297 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:56646) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying

            2010-10-21 16:11:27,987 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (main) Initializing partition DefaultPartition

            2010-10-21 16:11:28,076 INFO  [STDOUT] (JBoss System Threads(1)-3)

            ---------------------------------------------------------

            GMS: address is 10.230.3.15:56646 (cluster=DefaultPartition)

            ---------------------------------------------------------

            2010-10-21 16:11:28,173 INFO  [org.jboss.cache.jmx.PlatformMBeanServerRegistration] (main) JBossCache MBeans were successfully registered to the platform mbean server.

            2010-10-21 16:11:28,235 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Number of cluster members: 15

            2010-10-21 16:11:28,237 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Other members: 14

            2010-10-21 16:11:28,287 INFO  [STDOUT] (main)

            ---------------------------------------------------------

            GMS: address is 10.230.3.15:56646 (cluster=DefaultPartition-HAPartitionCache)

            ---------------------------------------------------------

            2010-10-21 16:11:31,294 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:56646) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying

            2010-10-21 16:11:34,297 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:56646) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying

             

             

            Getting this after restarts of nodes in my cluster

             

            Also seeing this on the 10.230.3.16 node that 10.230.3.15 is trying to connect to:

            2010-10-21 17:02:14,146 WARN  [org.jgroups.protocols.pbcast.GMS] (ViewHandler,DefaultPartition-HAPartitionCache,10.230.3.16:42616) GMS flush by coordinator at 10.230.3.16:42616 failed

             

            Thank You

             

            Cody

            • 3. Re: Cluster falls apart: FD_SOCK errors
              swapnath

              Hi Cody,

               

                I've exactly similar issue , did you find solution for it?

               

              Thanks

              Swapna