1 Reply Latest reply on Feb 5, 2005 7:23 PM by belaban

    Dropped packets causing problems

    damiandiaz

      Test Setup:
      2 node cluster between machines that occasionaly loose packets when talking (caused by an interface mistakenly being in half duplex mode, but that's besides the point)

      Setup cluster using UDP transport.

      Startup TreeCache in SYNC mode.

      Run TreeCacheAop.putObject in a tight loop on one of the machines.

      Results:
      Runs for a bit. When we drop a message, the current putObject call blocks until it eventually times out.

      Question:
      From what I understand, NAKACK takes care of re-requesting dropped packets by noticing message received out of order. Since we don't send any more messages ( because putObject is blocked) the far end doesn't know that it has missed a message yet and doesn't do the re-request. Is there a standard way to overcome this?

      Side notes:
      We tried using TCP transport to overcome the dropped packet problem. This fixed the problem but the performance suffered by a few orders of magnitude. Does this seem right for TCP transport?

      Another thing we tried was to ensure that more messages were sent between the machines by running other cache updates in a seperate thread. This also fixed the problem which confirms that NAKACK is doing it's job.


      Thanks for your help.

      Our config:

      <mbean code="org.jboss.cache.aop.TreeCacheAop" name="company.app:service=AppCacheSync">
       <depends>jboss:service=Naming</depends>
       <depends>jboss:service=TransactionManager</depends>
       <attribute name="JndiName">/AppCacheSync</attribute>
       <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>
       <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
       <attribute name="CacheMode">REPL_SYNC</attribute>
       <attribute name="ClusterName">AppTestCluster</attribute>
       <attribute name="ClusterConfig">
       <config>
       <UDP mcast_addr="228.1.2.3" mcast_port="48866" ip_ttl="64" ip_mcast="true"
       mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
       ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
       loopback="false"/>
       <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/>
       <MERGE2 min_interval="10000" max_interval="20000"/>
       <FD_SOCK/>
       <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false" down_thread="false"/>
       <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/>
       <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false"/>
       <FRAG frag_size="8192" down_thread="false" up_thread="false"/>
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
      
       <!--
       <TCP start_port="7800"/>
       <TCPPING initial_hosts="10.32.8.11[7800],10.32.8.14[7800]" port_range="5" timeout="3000" num_initial_members="3" up_thread="true" down_thread="true"/>
       <MERGE2 min_interval="10000" max_interval="20000"/>
       <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
       <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100" retransmit_timeout="3000"/>
       <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false"/>
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false" print_local_addr="false" down_thread="true" up_thread="true"/>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
       -->
       </config>
       </attribute>
       <attribute name="LockAcquisitionTimeout">15000</attribute>
       </mbean>


        • 1. Re: Dropped packets causing problems
          belaban

          Below is a description and solution for your problem (in JGroups/src/org/jgroups/protocols/pbcast/DESIGN).

          Solution is to increase gossip interval in STABLE


          Last Message dropped in NAKACK
          ------------------------------

          When a negative acknowledgment scheme (NACK or NAK) is used, senders
          send monotonically increasing sequence numbers (seqnos) and receivers
          expect them in the same sequence. If a gap is detected at a receiver
          R, R will send a retransmit request to the sender of that
          message. However, there is a problem: if a receiver R does not receive
          the last message M sent by P, and P does not send more messages, then
          R will not know that P sent M and therefore not request
          retransmission. This will be the case until P sends another message
          M'. At this point, R will request retransmission of M from P and only
          deliver M' after M has been received. Since this may never be the
          case, or take a long time, the following solution has been adopted:
          the STABLE layer includes an array of the highest seqnos received for
          each member. When a gossip has been received from each member, the
          stability vector will be sent by the STABLE layer up the stack to the
          NAKACK layer. The NAKACK protocol will then do its garbage collection
          based on the stability vector received. In addition, it will also
          check whether it has a copy of the highest messages for each sender,
          as indicated in the stability vector. If it doesn't, it will request
          retransmission of the missing message(s). A retransmission would only
          occur if (a) a message was not received and (b) it was the last
          message.