1 Reply Latest reply on Feb 5, 2005 7:23 PM by belaban

Dropped packets causing problems

damiandiaz Feb 3, 2005 6:59 PM

Test Setup:
2 node cluster between machines that occasionaly loose packets when talking (caused by an interface mistakenly being in half duplex mode, but that's besides the point)

Setup cluster using UDP transport.

Startup TreeCache in SYNC mode.

Run TreeCacheAop.putObject in a tight loop on one of the machines.

Results:
Runs for a bit. When we drop a message, the current putObject call blocks until it eventually times out.

Question:
From what I understand, NAKACK takes care of re-requesting dropped packets by noticing message received out of order. Since we don't send any more messages ( because putObject is blocked) the far end doesn't know that it has missed a message yet and doesn't do the re-request. Is there a standard way to overcome this?

Side notes:
We tried using TCP transport to overcome the dropped packet problem. This fixed the problem but the performance suffered by a few orders of magnitude. Does this seem right for TCP transport?

Another thing we tried was to ensure that more messages were sent between the machines by running other cache updates in a seperate thread. This also fixed the problem which confirms that NAKACK is doing it's job.

Thanks for your help.

Our config:

<mbean code="org.jboss.cache.aop.TreeCacheAop" name="company.app:service=AppCacheSync">
 <depends>jboss:service=Naming</depends>
 <depends>jboss:service=TransactionManager</depends>
 <attribute name="JndiName">/AppCacheSync</attribute>
 <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
 <attribute name="CacheMode">REPL_SYNC</attribute>
 <attribute name="ClusterName">AppTestCluster</attribute>
 <attribute name="ClusterConfig">
 <config>
 <UDP mcast_addr="228.1.2.3" mcast_port="48866" ip_ttl="64" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
 loopback="false"/>
 <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD_SOCK/>
 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false" down_thread="false"/>
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/>
 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false"/>
 <FRAG frag_size="8192" down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

 <!--
 <TCP start_port="7800"/>
 <TCPPING initial_hosts="10.32.8.11[7800],10.32.8.14[7800]" port_range="5" timeout="3000" num_initial_members="3" up_thread="true" down_thread="true"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
 <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100" retransmit_timeout="3000"/>
 <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false" print_local_addr="false" down_thread="true" up_thread="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 -->
 </config>
 </attribute>
 <attribute name="LockAcquisitionTimeout">15000</attribute>
 </mbean>

1. Re: Dropped packets causing problems

belaban Feb 5, 2005 7:23 PM (in response to damiandiaz)

Below is a description and solution for your problem (in JGroups/src/org/jgroups/protocols/pbcast/DESIGN).

Solution is to increase gossip interval in STABLE

Last Message dropped in NAKACK
------------------------------

When a negative acknowledgment scheme (NACK or NAK) is used, senders
send monotonically increasing sequence numbers (seqnos) and receivers
expect them in the same sequence. If a gap is detected at a receiver
R, R will send a retransmit request to the sender of that
message. However, there is a problem: if a receiver R does not receive
the last message M sent by P, and P does not send more messages, then
R will not know that P sent M and therefore not request
retransmission. This will be the case until P sends another message
M'. At this point, R will request retransmission of M from P and only
deliver M' after M has been received. Since this may never be the
case, or take a long time, the following solution has been adopted:
the STABLE layer includes an array of the highest seqnos received for
each member. When a gossip has been received from each member, the
stability vector will be sent by the STABLE layer up the stack to the
NAKACK layer. The NAKACK protocol will then do its garbage collection
based on the stability vector received. In addition, it will also
check whether it has a copy of the highest messages for each sender,
as indicated in the stability vector. If it doesn't, it will request
retransmission of the missing message(s). A retransmission would only
occur if (a) a message was not received and (b) it was the last
message.
Actions