3 Replies Latest reply on Aug 22, 2007 9:48 AM by brian.stansberry

JBAS-4608 Adding FLUSH to 4.2.x stacks

brian.stansberry Aug 14, 2007 1:15 PM

Discussion of http://jira.jboss.com/jira/browse/JBAS-4608 .

From a support case discussion:

"bstansberry" wrote:

Will adding FLUSH to a 2.4.1.SP3/4 stack do anything to ensure that a view is received on all nodes before messages for that view are allowed down the stack?
"bela" wrote:

In most cases, yes. But in 2.4.x, FLUSH didn't have the reconcile code in it, which meant that we just stopped everyone from sending messages until the new view V2 was installed. However, we didn't reconcile all messages in V1, meaning that we didn't ensure that everyone received exactly the same messages in V1. To do that, we added code in 2.5, which resends missed messages to all members in V1 until V2 is installed.

From this it seems adding FLUSH would have some benefit. The HAPartition and JBC 1.x don't do anything with the block() callback, so it would seem there'd be no negative side effect of suddenly invoking new code paths in the apps via block().

A downside is it's a config change, which isn't so good in a point release.

A concern I have about this is the virtual synchrony issues David Ward experienced apparently weren't there in 4.0.5, which didn't have FLUSH. That implies some semi-regression occurred. If the regression was a known effect of some design change where the intent was to counteract it via use of FLUSH, that's OK. But it would be good to have a clearer understanding of what changed.

1. Re: JBAS-4608 Adding FLUSH to 4.2.x stacks

belaban Aug 15, 2007 2:37 AM (in response to brian.stansberry)

I'm not in favor of adding FLUSH in a point release. Let's find out first what happened, why the call wasn't received, and what changed between the 2 releases so this could happen
Actions
2. Re: JBAS-4608 Adding FLUSH to 4.2.x stacks

brian.stansberry Aug 15, 2007 10:16 AM (in response to brian.stansberry)

Agreed on the point release change. If another minor release ever comes out of 4.x (more likely than you might imagine ;) ) then we can consider it.

Re: chasing down what changed, to unify things on this thread I'm going to cut and paste the relevant log analysis from David's support case that led to this issue. Background is David runs a demo app very similar to the ejb lab in the clustering course. 2 nodes on same machine; client in 3rd process exercises a simple counter SFSB in a loop; kill nodes and observe failover. David runs with 2 client processes, not one like the lab. I don't think that's relevant. Problem occurs when the killed node is restarted -- during the restart, client invocations pause for extended periods.

From the support case:

What's going on here is the SFSB service is attempting to replicate a change before the newly joined node has received a view. The JGroups channel on the new node drops the RPC message. The sender then blocks waiting for a response.

Node1 installs a view with node2 included:

2007-08-09 15:13:14,397 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 5, delta: 1) : [192.168.200.1:1099, 192.168.200.2:1099]

Client requests on node1 result in session replication RPCs:

2007-08-09 15:13:14,431 INFO [STDOUT] Total arrested guys: 79
2007-08-09 15:13:14,432 TRACE [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] callMethodOnCluster(true), objName=SessionState-'/HASessionState/Default', methodName=_setState, members=[192.168.200.2:33177]
2007-08-09 15:13:14,432 TRACE [org.jboss.ha.framework.server.HAPartitionImpl] dests=[192.168.200.2:33177], method_call=SessionState-'/HASessionState/Default'._setState(GangDispute, org.jboss.ha.hasessionstate.server.PackagedSessionImpl@1209d56), mode=2, timeout=60000
2007-08-09 15:13:14,432 TRACE [org.jboss.ha.framework.server.HAPartitionImpl] real_dests=[192.168.200.2:33177]
2007-08-09 15:13:14,432 TRACE [org.jgroups.blocks.GroupRequest] sending request (id=1186686794432)
2007-08-09 15:13:14,432 TRACE [org.jgroups.protocols.pbcast.NAKACK] sending 192.168.200.1:33167#27
2007-08-09 15:13:14,433 TRACE [org.jgroups.protocols.UDP] sending msg to null (src=192.168.200.1:33167), headers are {NAKACK=[MSG, seqno=27], MessageDispatcher=[Header: name=MessageDispatcher, type=REQ, id=1186686794432, rsp_expected=true], dest_mbrs=[192.168.200.2:33177], UDP=[channel_name=DefaultPartition]}
2007-08-09 15:13:14,433 TRACE [org.jgroups.protocols.UDP] received (mcast) 940 bytes from 192.168.200.1:33168
2007-08-09 15:13:14,433 TRACE [org.jgroups.protocols.UDP] message is [dst: 228.1.2.3:45566, src: 192.168.200.1:33167 (3 headers), size = 759 bytes], headers are {NAKACK=[MSG, seqno=27], MessageDispatcher=[Header: name=MessageDispatcher, type=REQ, id=1186686794432, rsp_expected=true], dest_mbrs=[192.168.200.2:33177], UDP=[channel_name=DefaultPartition]}
2007-08-09 15:13:14,434 TRACE [org.jgroups.protocols.pbcast.NAKACK] 192.168.200.1:33167: received 192.168.200.1:33167#27
2007-08-09 15:13:14,434 TRACE [org.jgroups.blocks.RequestCorrelator] discarded request from 192.168.200.1:33167 as we are not part of destination list (local_addr=192.168.200.1:33167, hdr=[Header: name=MessageDispatcher, type=REQ, id=1186686794432, rsp_expected=true], dest_mbrs=[192.168.200.2:33177])
2007-08-09 15:13:14,580 INFO [STDOUT] Total arrested guys: 79
2007-08-09 15:13:14,581 TRACE [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] callMethodOnCluster(true), objName=SessionState-'/HASessionState/Default', methodName=_setState, members=[192.168.200.2:33177]
2007-08-09 15:13:14,581 TRACE [org.jboss.ha.framework.server.HAPartitionImpl] dests=[192.168.200.2:33177], method_call=SessionState-'/HASessionState/Default'._setState(GangDispute, org.jboss.ha.hasessionstate.server.PackagedSessionImpl@347cdb), mode=2, timeout=60000
2007-08-09 15:13:14,581 TRACE [org.jboss.ha.framework.server.HAPartitionImpl] real_dests=[192.168.200.2:33177]
2007-08-09 15:13:14,581 TRACE [org.jgroups.blocks.GroupRequest] sending request (id=1186686794581)

Node2 gets and discards the RPC messages:

2007-08-09 15:13:14,433 TRACE [org.jgroups.protocols.UDP] received (mcast) 940 bytes from 192.168.200.1:33168
2007-08-09 15:13:14,434 TRACE [org.jgroups.protocols.UDP] message is [dst: 228.1.2.3:45566, src: 192.168.200.1:33167 (3 headers), size = 759 bytes], headers are {NAKACK=[MSG, seqno=27], MessageDispatcher=[Header: name=MessageDispatcher, type=REQ, id=1186686794432, rsp_expected=true], dest_mbrs=[192.168.200.2:33177], UDP=[channel_name=DefaultPartition]}
2007-08-09 15:13:14,434 TRACE [org.jgroups.protocols.pbcast.NAKACK] message was discarded (not yet server)
2007-08-09 15:13:14,582 TRACE [org.jgroups.protocols.UDP] received (mcast) 940 bytes from 192.168.200.1:33168
2007-08-09 15:13:14,583 TRACE [org.jgroups.protocols.UDP] message is [dst: 228.1.2.3:45566, src: 192.168.200.1:33167 (3 headers), size = 759 bytes], headers are {NAKACK=[MSG, seqno=28], MessageDispatcher=[Header: name=MessageDispatcher, type=REQ, id=1186686794581, rsp_expected=true], dest_mbrs=[192.168.200.2:33177], UDP=[channel_name=DefaultPartition]}
2007-08-09 15:13:14,583 TRACE [org.jgroups.protocols.pbcast.NAKACK] message was discarded (not yet server)

Node2 installs view:

2007-08-09 15:13:14,712 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] ViewAccepted: initial members set

A minute later the RPCs time out on node1:

2007-08-09 15:14:14,443 TRACE [org.jgroups.blocks.GroupRequest] call did not execute correctly, request is [GroupRequest:
req_id=1186686794432
caller=192.168.200.1:33167
192.168.200.2:33177: sender=192.168.200.2:33177, retval=null, received=false, suspected=false

request_msg: [dst: , src: 192.168.200.1:33167 (3 headers), size = 759 bytes]
rsp_mode: GET_ALL
done: false
timeout: 60000
expected_mbrs: 0
]

Full logs with org.jgroups and org.jboss.ha are attached to David's support case 00017603.
Actions
3. Re: JBAS-4608 Adding FLUSH to 4.2.x stacks

brian.stansberry Aug 22, 2007 9:48 AM (in response to brian.stansberry)

See http://www.jboss.com/index.html?module=bb&op=viewtopic&t=116669 for another problem that is quite similar -- view change is exposed to the application before all the protocols in the stack are aware of it. In this case with JBC 2.0 and JGroups 2.5.

I was thinking this might have something to do with the change introduced in http://jira.jboss.com/jira/browse/JGRP-347 , but in both David's case and the referenced case the failure occurred on the coordinator. My (quite possibly incorrect) read of the code involved in JGRP-347 leads me to believe that change would only affect the non-coordinator nodes.
Actions

Go to original post