-
1. Re: JBAS-4608 Adding FLUSH to 4.2.x stacks
belaban Aug 15, 2007 2:37 AM (in response to brian.stansberry)I'm not in favor of adding FLUSH in a point release. Let's find out first what happened, why the call wasn't received, and what changed between the 2 releases so this could happen
-
2. Re: JBAS-4608 Adding FLUSH to 4.2.x stacks
brian.stansberry Aug 15, 2007 10:16 AM (in response to brian.stansberry)Agreed on the point release change. If another minor release ever comes out of 4.x (more likely than you might imagine ;) ) then we can consider it.
Re: chasing down what changed, to unify things on this thread I'm going to cut and paste the relevant log analysis from David's support case that led to this issue. Background is David runs a demo app very similar to the ejb lab in the clustering course. 2 nodes on same machine; client in 3rd process exercises a simple counter SFSB in a loop; kill nodes and observe failover. David runs with 2 client processes, not one like the lab. I don't think that's relevant. Problem occurs when the killed node is restarted -- during the restart, client invocations pause for extended periods.
From the support case:
What's going on here is the SFSB service is attempting to replicate a change before the newly joined node has received a view. The JGroups channel on the new node drops the RPC message. The sender then blocks waiting for a response.
Node1 installs a view with node2 included:
2007-08-09 15:13:14,397 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 5, delta: 1) : [192.168.200.1:1099, 192.168.200.2:1099]
Client requests on node1 result in session replication RPCs:
2007-08-09 15:13:14,431 INFO [STDOUT] Total arrested guys: 79
2007-08-09 15:13:14,432 TRACE [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] callMethodOnCluster(true), objName=SessionState-'/HASessionState/Default', methodName=_setState, members=[192.168.200.2:33177]
2007-08-09 15:13:14,432 TRACE [org.jboss.ha.framework.server.HAPartitionImpl] dests=[192.168.200.2:33177], method_call=SessionState-'/HASessionState/Default'._setState(GangDispute, org.jboss.ha.hasessionstate.server.PackagedSessionImpl@1209d56), mode=2, timeout=60000
2007-08-09 15:13:14,432 TRACE [org.jboss.ha.framework.server.HAPartitionImpl] real_dests=[192.168.200.2:33177]
2007-08-09 15:13:14,432 TRACE [org.jgroups.blocks.GroupRequest] sending request (id=1186686794432)
2007-08-09 15:13:14,432 TRACE [org.jgroups.protocols.pbcast.NAKACK] sending 192.168.200.1:33167#27
2007-08-09 15:13:14,433 TRACE [org.jgroups.protocols.UDP] sending msg to null (src=192.168.200.1:33167), headers are {NAKACK=[MSG, seqno=27], MessageDispatcher=[Header: name=MessageDispatcher, type=REQ, id=1186686794432, rsp_expected=true], dest_mbrs=[192.168.200.2:33177], UDP=[channel_name=DefaultPartition]}
2007-08-09 15:13:14,433 TRACE [org.jgroups.protocols.UDP] received (mcast) 940 bytes from 192.168.200.1:33168
2007-08-09 15:13:14,433 TRACE [org.jgroups.protocols.UDP] message is [dst: 228.1.2.3:45566, src: 192.168.200.1:33167 (3 headers), size = 759 bytes], headers are {NAKACK=[MSG, seqno=27], MessageDispatcher=[Header: name=MessageDispatcher, type=REQ, id=1186686794432, rsp_expected=true], dest_mbrs=[192.168.200.2:33177], UDP=[channel_name=DefaultPartition]}
2007-08-09 15:13:14,434 TRACE [org.jgroups.protocols.pbcast.NAKACK] 192.168.200.1:33167: received 192.168.200.1:33167#27
2007-08-09 15:13:14,434 TRACE [org.jgroups.blocks.RequestCorrelator] discarded request from 192.168.200.1:33167 as we are not part of destination list (local_addr=192.168.200.1:33167, hdr=[Header: name=MessageDispatcher, type=REQ, id=1186686794432, rsp_expected=true], dest_mbrs=[192.168.200.2:33177])
2007-08-09 15:13:14,580 INFO [STDOUT] Total arrested guys: 79
2007-08-09 15:13:14,581 TRACE [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] callMethodOnCluster(true), objName=SessionState-'/HASessionState/Default', methodName=_setState, members=[192.168.200.2:33177]
2007-08-09 15:13:14,581 TRACE [org.jboss.ha.framework.server.HAPartitionImpl] dests=[192.168.200.2:33177], method_call=SessionState-'/HASessionState/Default'._setState(GangDispute, org.jboss.ha.hasessionstate.server.PackagedSessionImpl@347cdb), mode=2, timeout=60000
2007-08-09 15:13:14,581 TRACE [org.jboss.ha.framework.server.HAPartitionImpl] real_dests=[192.168.200.2:33177]
2007-08-09 15:13:14,581 TRACE [org.jgroups.blocks.GroupRequest] sending request (id=1186686794581)
Node2 gets and discards the RPC messages:
2007-08-09 15:13:14,433 TRACE [org.jgroups.protocols.UDP] received (mcast) 940 bytes from 192.168.200.1:33168
2007-08-09 15:13:14,434 TRACE [org.jgroups.protocols.UDP] message is [dst: 228.1.2.3:45566, src: 192.168.200.1:33167 (3 headers), size = 759 bytes], headers are {NAKACK=[MSG, seqno=27], MessageDispatcher=[Header: name=MessageDispatcher, type=REQ, id=1186686794432, rsp_expected=true], dest_mbrs=[192.168.200.2:33177], UDP=[channel_name=DefaultPartition]}
2007-08-09 15:13:14,434 TRACE [org.jgroups.protocols.pbcast.NAKACK] message was discarded (not yet server)
2007-08-09 15:13:14,582 TRACE [org.jgroups.protocols.UDP] received (mcast) 940 bytes from 192.168.200.1:33168
2007-08-09 15:13:14,583 TRACE [org.jgroups.protocols.UDP] message is [dst: 228.1.2.3:45566, src: 192.168.200.1:33167 (3 headers), size = 759 bytes], headers are {NAKACK=[MSG, seqno=28], MessageDispatcher=[Header: name=MessageDispatcher, type=REQ, id=1186686794581, rsp_expected=true], dest_mbrs=[192.168.200.2:33177], UDP=[channel_name=DefaultPartition]}
2007-08-09 15:13:14,583 TRACE [org.jgroups.protocols.pbcast.NAKACK] message was discarded (not yet server)
Node2 installs view:
2007-08-09 15:13:14,712 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] ViewAccepted: initial members set
A minute later the RPCs time out on node1:
2007-08-09 15:14:14,443 TRACE [org.jgroups.blocks.GroupRequest] call did not execute correctly, request is [GroupRequest:
req_id=1186686794432
caller=192.168.200.1:33167
192.168.200.2:33177: sender=192.168.200.2:33177, retval=null, received=false, suspected=false
request_msg: [dst: , src: 192.168.200.1:33167 (3 headers), size = 759 bytes]
rsp_mode: GET_ALL
done: false
timeout: 60000
expected_mbrs: 0
]
Full logs with org.jgroups and org.jboss.ha are attached to David's support case 00017603. -
3. Re: JBAS-4608 Adding FLUSH to 4.2.x stacks
brian.stansberry Aug 22, 2007 9:48 AM (in response to brian.stansberry)See http://www.jboss.com/index.html?module=bb&op=viewtopic&t=116669 for another problem that is quite similar -- view change is exposed to the application before all the protocols in the stack are aware of it. In this case with JBC 2.0 and JGroups 2.5.
I was thinking this might have something to do with the change introduced in http://jira.jboss.com/jira/browse/JGRP-347 , but in both David's case and the referenced case the failure occurred on the coordinator. My (quite possibly incorrect) read of the code involved in JGRP-347 leads me to believe that change would only affect the non-coordinator nodes.