JBPAPP-653
brian.stansberry Mar 24, 2008 5:41 PMDiscussion (perhaps with myself) re: logs attached to http://jira.jboss.com/jira/browse/JBPAPP-653.
Couple indications point to a problem with perf02 receiving messages.
1) perf04, which would use perf02 as a buddy no longer logs remoting's "closing" messages after this:
2008-02-21 08:33:32,828 DEBUG [org.jboss.remoting.transport.socket.SocketWrapper] ServerSocketWrapper[Socket[addr=/172.17.68.254,port=33698,localport=3873].4d8918] closing
Those messages are logged at the end of an EJB request; not logging means the request isn't completed yet. Lot's of logging of intermediate steps in request, but no completion.
The cache is REPL_ASYNC, so that implies to me the requests are blocking in JGroups FC. (No JBC locking issues reported).
2) On perf03, a similar pattern of no "closing" begins at 08:33:35,130. But then you get a few of these:
2008-02-21 08:34:01,393 WARN [org.jgroups.protocols.FC] Received two credit requests from 172.17.65.39:40890 without any intervening messages; sending 1999633 credits
Indicates perf02 is not getting FC credits from perf03.
Then this:
2008-02-21 08:34:53,627 WARN [org.jgroups.protocols.FD] I was suspected by 172.17.65.39:40890; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
again indicating non-delivery of heartbeat responses to perf02.FD.
All 3 live nodes begin at roughly 08:33:30 a pattern of logging EJB request activity but no "closing". Looks like everyone is blocking waiting for FC credits. Any multicasts in the message mix? Only one that comes to mind is data gravitation. But the troublesome logging pattern begins nearly a minute after perf01 is stopped, so why would there still be a lot of data gravitation going on?
Another weird thing: when perf01 restarts, its Tomcat cache (not the EJB3 SFSB one under load) has trouble joining the group:
2008-02-21 08:34:41,546 WARN [org.jgroups.protocols.pbcast.GMS] join(172.17.64.39:52927) sent to 172.17.65.39:40883 timed out, retrying
2008-02-21 08:34:48,553 WARN [org.jgroups.protocols.pbcast.GMS] join(172.17.64.39:52927) sent to 172.17.65.39:40883 timed out, retrying
2008-02-21 08:34:55,559 WARN [org.jgroups.protocols.pbcast.GMS] join(172.17.64.39:52927) sent to 172.17.65.39:40883 timed out, retrying
Non-responsive node is again perf02, but this is a completely different channel. Just a CPU overload on perf02?
Perf02's Tomcat cache does finally issue a view with perf01 in it, but only after logging this:
2008-02-21 08:35:11,784 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all ACKs (3) for view [172.17.65.39:40883|5] [172.17.65.39:40883, 172.17.66.39:35267, 172.17.67.39:39896, 172.17.64.39:52927] after 5000ms, missing ACKs from [172.17.65.39:40883, 172.17.66.39:35267, 172.17.67.39:39896] (received=[]), local_addr=172.17.65.39:40883
Note it doesn't get an ACK from itself!
Eventually, perf03's sfsb cache channel excludes perf02:
2008-02-21 08:35:03,233 INFO [org.jboss.cache.TreeCache] viewAccepted(): [172.17.66.39:35274|5] [172.17.66.39:35274, 172.17.67.39:39903]
But perf04 doesn't acknowledge the new view:
2008-02-21 08:35:08,202 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all ACKs (2) for view [172.17.66.39:35274|5] [172.17.66.39:35274, 172.17.67.39:39903] after 5000ms, missing ACKs from [172.17.67.39:39903] (received=[172.17.66.39:35274]), local_addr=172.17.66.39:35274
No viewAccepted() with the new view ever appears on the perf04 logs.
perf02 also doesn't know it's been excluded, continuing to send suspect messages to perf03:
2008-02-21 08:35:11,785 WARN [org.jgroups.protocols.FD] I was suspected by 172.17.65.39:40890; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK