1 Reply Latest reply on Sep 24, 2013 2:57 AM by stefanrinderle

    5 second pauses on startup replication

    stefanrinderle

      Hey guys,

       

      we are running infinispan 5.2 in replicated sync mode on multiple nodes using jgroups tcp (see attached infinispan.xml and jgroups.xml) and weblogic as appserver.

       

      The problem occurs in the following scenario:

      We have about 500000 objects in the cache of a running node. If we start another node, the cache replication starts as expected but it needs a huge amount of time. We took a closer look at the log files:

       

      2013-08-26 17:47:04,130 - DEBUG -  - () - StateConsumerImpl: Applying new state for segment 0 of cache ACCESS_TOKEN from node X: received 10000 cache entries

      2013-08-26 17:47:04,148 - DEBUG -  - () - StateConsumerImpl: Finished applying state for segment 0 of cache ACCESS_TOKEN

      2013-08-26 17:47:09,318 - DEBUG -  - () - StateConsumerImpl: Applying new state for segment 0 of cache ACCESS_TOKEN from node X: received 10000 cache entries

      2013-08-26 17:47:09,337 - DEBUG -  - () - StateConsumerImpl: Finished applying state for segment 0 of cache ACCESS_TOKEN

      2013-08-26 17:47:14,497 - DEBUG -  - () - StateConsumerImpl: Applying new state for segment 0 of cache ACCESS_TOKEN from node X: received 10000 cache entries

      2013-08-26 17:47:14,517 - DEBUG -  - () - StateConsumerImpl: Finished applying state for segment 0 of cache ACCESS_TOKEN

       

      As you can see, there is a 5 second pause after 10000 entries so the replication would need over 8 minutes to transfer 500000 objects but a timeout occured after 60 seconds. You can also see the pauses in the network monitor. A short peak for the 10000 entries and then 5 seconds nothing.

       

      We already found the chunkSize param to increase the number of elements replicated in one peak, but we have no idea where the 5 second breaks come from and why they are there. The cached objects are pretty small so we thought that the replication should be done after a few seconds, more than 5 minutes is really a long time :-)

       

      It would be very helpful If anyone has an idea where to look or why these breaks occur. If you need any further details, we will provide it...

       

      Thanks in advance

       

      Stefan

        • 1. Re: 5 second pauses on startup replication
          stefanrinderle

          We still have the described issues but we identified that this is probably an issue with jgroups. If we change the UFC and MFC configuration to

           

          <UFC max_credits="200k" min_threshold="0.20" max_block_time="500" />

          <MFC max_credits="200k" min_threshold="0.20" max_block_time="500" />

           

          we only have a 500 ms pause instead of 5 seconds. So it must have something to do with the max_block_time.

           

          We would like to know why or where this blocking occurs (root cause). It would be great if anyone has an idea where to look.

           

          Thanks in advance

           

          Stefan