2 Replies Latest reply on Nov 24, 2008 5:29 AM by Galder Zamarreño


    Brian Stansberry Master

      Discussion (perhaps with myself) re: logs attached to http://jira.jboss.com/jira/browse/JBPAPP-653.

      Couple indications point to a problem with perf02 receiving messages.

      1) perf04, which would use perf02 as a buddy no longer logs remoting's "closing" messages after this:

      2008-02-21 08:33:32,828 DEBUG [org.jboss.remoting.transport.socket.SocketWrapper] ServerSocketWrapper[Socket[addr=/,port=33698,localport=3873].4d8918] closing

      Those messages are logged at the end of an EJB request; not logging means the request isn't completed yet. Lot's of logging of intermediate steps in request, but no completion.

      The cache is REPL_ASYNC, so that implies to me the requests are blocking in JGroups FC. (No JBC locking issues reported).

      2) On perf03, a similar pattern of no "closing" begins at 08:33:35,130. But then you get a few of these:

      2008-02-21 08:34:01,393 WARN [org.jgroups.protocols.FC] Received two credit requests from without any intervening messages; sending 1999633 credits

      Indicates perf02 is not getting FC credits from perf03.

      Then this:

      2008-02-21 08:34:53,627 WARN [org.jgroups.protocols.FD] I was suspected by; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      again indicating non-delivery of heartbeat responses to perf02.FD.

      All 3 live nodes begin at roughly 08:33:30 a pattern of logging EJB request activity but no "closing". Looks like everyone is blocking waiting for FC credits. Any multicasts in the message mix? Only one that comes to mind is data gravitation. But the troublesome logging pattern begins nearly a minute after perf01 is stopped, so why would there still be a lot of data gravitation going on?

      Another weird thing: when perf01 restarts, its Tomcat cache (not the EJB3 SFSB one under load) has trouble joining the group:

      2008-02-21 08:34:41,546 WARN [org.jgroups.protocols.pbcast.GMS] join( sent to timed out, retrying
      2008-02-21 08:34:48,553 WARN [org.jgroups.protocols.pbcast.GMS] join( sent to timed out, retrying
      2008-02-21 08:34:55,559 WARN [org.jgroups.protocols.pbcast.GMS] join( sent to timed out, retrying

      Non-responsive node is again perf02, but this is a completely different channel. Just a CPU overload on perf02?

      Perf02's Tomcat cache does finally issue a view with perf01 in it, but only after logging this:

      2008-02-21 08:35:11,784 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all ACKs (3) for view [|5] [,,,] after 5000ms, missing ACKs from [,,] (received=[]), local_addr=

      Note it doesn't get an ACK from itself!

      Eventually, perf03's sfsb cache channel excludes perf02:

      2008-02-21 08:35:03,233 INFO [org.jboss.cache.TreeCache] viewAccepted(): [|5] [,]

      But perf04 doesn't acknowledge the new view:

      2008-02-21 08:35:08,202 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all ACKs (2) for view [|5] [,] after 5000ms, missing ACKs from [] (received=[]), local_addr=

      No viewAccepted() with the new view ever appears on the perf04 logs.

      perf02 also doesn't know it's been excluded, continuing to send suspect messages to perf03:

      2008-02-21 08:35:11,785 WARN [org.jgroups.protocols.FD] I was suspected by; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

        • 1. Re: JBPAPP-653
          Mohit Anchlia Novice

          This is related to above discussion. I am seeing some weird behavior. We are running into this serious issue where Nodes join the cluster and soon after they disappear from the cluster. So for eg: If I have 5 nodes then they initially join the cluster and then after some time we see any one of the following:

          1. Dead member message for one of the nodes - even though the dead node is up and running. I can run Jgroup sender reciever test from dead node to other nodes with no problems. I would assume that dead member would try to communicate back after some time in case there was a temporary problem. But that doesn't seem to be happening.

          2. As noted above in your discussion I get 2008-02-21 08:35:11,784 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all ACKs (3) for view [|5] [,,,] after 5000ms, missing ACKs from [,, 172.17.6 .....

          I am not sure why that;s happening and what it really means. All I can guess is that it's not able to get the datagram. Also, I am assuming this is the coordinator.

          3. In a cluster of 5, all 5 initially join cluster and after some time what we see is that node 1,2,3 become part of one cluster and 4,5 becomes another cluster. All of them have same udp group, name and port so I don't really understand how they can split and why they don't get merged back together if there was a temporary issue.

          Overall I am not able to understand this wierdness. I am planning to run some Jgroup load test. We've spoken to our network team and they don't see any issues on switch. I've looked at the NIC and don't see any problems. IGMP is enabled on all the routers. Also, how can I tell which node is now the coordinator?

          I also did tracroute to make sure ttl is not the problem.

          It would really be helpful if you could let me know how I can debug this issue. It's really weird. Below is the UDP jgroups config:


          <!-- The JGroups protocol configuration -->

          The default UDP stack:
          - If you have a multihomed machine, set the UDP protocol's bind_addr attribute to the
          appropriate NIC IP address, e.g bind_addr="".
          - On Windows machines, because of the media sense feature being broken with multicast
          (even after disabling media sense) set the UDP protocol's loopback attribute to true

          <UDP mcast_addr="${efe.partition.udpGroup:}"
          down_thread="false" up_thread="false"/>
          <PING timeout="2000"
          down_thread="false" up_thread="false" num_initial_members="3"/>
          <MERGE2 max_interval="100000"
          down_thread="false" up_thread="false" min_interval="20000"/>
          <FD_SOCK down_thread="false" up_thread="false"/>
          <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
          <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
          <pbcast.NAKACK max_xmit_size="60000"
          use_mcast_xmit="false" gc_lag="0"
          down_thread="false" up_thread="false"
          <UNICAST timeout="300,600,1200,2400,3600"
          down_thread="false" up_thread="false"/>
          <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
          down_thread="false" up_thread="false"
          <pbcast.GMS print_local_addr="true" join_timeout="3000"
          down_thread="false" up_thread="false"
          join_retry_timeout="2000" shun="true"
          <FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
          <pbcast.STATE_TRANSFER down_thread="false" up_thread="false" use_flush="false"/>