Dropped packets causing problems
damiandiaz Feb 3, 2005 6:59 PMTest Setup:
2 node cluster between machines that occasionaly loose packets when talking (caused by an interface mistakenly being in half duplex mode, but that's besides the point)
Setup cluster using UDP transport.
Startup TreeCache in SYNC mode.
Run TreeCacheAop.putObject in a tight loop on one of the machines.
Results:
Runs for a bit. When we drop a message, the current putObject call blocks until it eventually times out.
Question:
From what I understand, NAKACK takes care of re-requesting dropped packets by noticing message received out of order. Since we don't send any more messages ( because putObject is blocked) the far end doesn't know that it has missed a message yet and doesn't do the re-request. Is there a standard way to overcome this?
Side notes:
We tried using TCP transport to overcome the dropped packet problem. This fixed the problem but the performance suffered by a few orders of magnitude. Does this seem right for TCP transport?
Another thing we tried was to ensure that more messages were sent between the machines by running other cache updates in a seperate thread. This also fixed the problem which confirms that NAKACK is doing it's job.
Thanks for your help.
Our config:
<mbean code="org.jboss.cache.aop.TreeCacheAop" name="company.app:service=AppCacheSync"> <depends>jboss:service=Naming</depends> <depends>jboss:service=TransactionManager</depends> <attribute name="JndiName">/AppCacheSync</attribute> <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute> <attribute name="IsolationLevel">REPEATABLE_READ</attribute> <attribute name="CacheMode">REPL_SYNC</attribute> <attribute name="ClusterName">AppTestCluster</attribute> <attribute name="ClusterConfig"> <config> <UDP mcast_addr="228.1.2.3" mcast_port="48866" ip_ttl="64" ip_mcast="true" mcast_send_buf_size="150000" mcast_recv_buf_size="80000" ucast_send_buf_size="150000" ucast_recv_buf_size="80000" loopback="false"/> <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/> <MERGE2 min_interval="10000" max_interval="20000"/> <FD_SOCK/> <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/> <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false" down_thread="false"/> <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/> <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false"/> <FRAG frag_size="8192" down_thread="false" up_thread="false"/> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/> <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/> <!-- <TCP start_port="7800"/> <TCPPING initial_hosts="10.32.8.11[7800],10.32.8.14[7800]" port_range="5" timeout="3000" num_initial_members="3" up_thread="true" down_thread="true"/> <MERGE2 min_interval="10000" max_interval="20000"/> <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/> <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100" retransmit_timeout="3000"/> <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false"/> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false" print_local_addr="false" down_thread="true" up_thread="true"/> <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/> --> </config> </attribute> <attribute name="LockAcquisitionTimeout">15000</attribute> </mbean>