1 2 Previous Next 17 Replies Latest reply on Mar 8, 2011 12:56 PM by galder.zamarreno

    Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer

    cbo_

      Need Help!  Using 4.2.0.FINAL and doing some performance testing and just found a functional problem.  This is replicated scenario using ReplicationQueue between 2 nodes with one none only doing the updates.

       

      The special focus is on a single row that has an ArrayList with many, many entries.  It may raise your eyebrows to know the depth of this row, but I don't want to focus so much on that as I do want to point one thing out.  This entry does get to the partner app on a stateTransfer. 

       

      To clarify.....I will call the app that actively does the updates to the cache A and the one that is passive B.  When the scenario is such that A comes up and does updates to its caches ahead of starting B (which is our normal case), the stateTransfer works correctly and brings all rows to B correctly.  When the scenario includes a restart of A....so A comes up, then B, and some situation occurs that requires A to restart (and B stays up), it is up to the put on A to get across to B (this is because we first do a clear on the cache on A when it comes up).  In this case all rows in get across from A to B except for this wide/deep row.  I have put detailed trace levels on during a test where this occurred and it is evident that the put is done on A, but no evidence it is received by B at all.  Here is a snippet of the logs at that time:

       

      This is the A side:

      2010-12-23 10:55:00,194 124082 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Scheduled-ReplicationQueueThread-0:) dests=null, command=MultipleRpcCommand{commands=[PutKeyValueCommand{key=5, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@6e64f6d5, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='GenericCache'}, mode=ASYNCHRONOUS, timeout=15000
      2010-12-23 10:55:00,194 124082 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Scheduled-ReplicationQueueThread-0:) dests=null, command=MultipleRpcCommand{commands=[PutKeyValueCommand{key=5, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@6e64f6d5, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='GenericCache'}, mode=ASYNCHRONOUS, timeout=15000
      2010-12-23 10:55:00,195 124083 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (AsyncSerializationThread-0:) Replication task sending MultipleRpcCommand{commands=[PutKeyValueCommand{key=5, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@6e64f6d5, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='GenericCache'} to addresses null
      2010-12-23 10:55:00,205 124093 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=3)
      2010-12-23 10:55:00,205 124093 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) Flushing 3 elements
      2010-12-23 10:55:00,205 124093 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Scheduled-ReplicationQueueThread-0:) dests=null, command=MultipleRpcCommand{commands=[PutKeyValueCommand{key=6, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@71130ea6, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=1, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@1a26ee49, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=2, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@71e60808, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='GenericCache'}, mode=ASYNCHRONOUS, timeout=15000
      2010-12-23 10:55:00,216 124104 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,228 124116 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,240 124128 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,251 124139 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,262 124150 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,272 124160 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,288 124176 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,304 124192 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,315 124203 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,326 124214 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,338 124226 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,351 124239 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,353 124241 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (AsyncSerializationThread-0:) Replication task sending MultipleRpcCommand{commands=[PutKeyValueCommand{key=6, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@71130ea6, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=1, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@1a26ee49, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=2, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@71e60808, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='GenericCache'} to addresses null
      2010-12-23 10:55:00,362 124250 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,366 124254 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (AsyncSerializationThread-0:) real_dests=[test16_devenv1-61733]
      2010-12-23 10:55:00,367 124255 TRACE [org.jgroups.protocols.pbcast.NAKACK] (AsyncSerializationThread-0:) sending test16_devenv2-53120#4
      2010-12-23 10:55:00,367 124255 TRACE [org.jgroups.protocols.TCP] (AsyncSerializationThread-0:) sending msg to null, src=test16_devenv2-53120, headers are RequestCorrelator: id=201, type=REQ, id=3, rsp_expected=false, NAKACK: [MSG, seqno=4], TCP: [channel_name=infinispan-cluster-OHServerHybrid-test16]
      2010-12-23 10:55:00,367 124255 TRACE [org.jgroups.protocols.TCP] (AsyncSerializationThread-0:) looping back message [dst: <null>, src: test16_devenv2-53120 (3 headers), size=494 bytes]
      2010-12-23 10:55:00,367 124255 TRACE [org.jgroups.protocols.TCP] (AsyncSerializationThread-0:) dest=170.137.237.83:11600 (598 bytes)
      2010-12-23 10:55:00,367 124255 TRACE [org.jgroups.protocols.TCP] (AsyncSerializationThread-0:) dest=170.137.237.81:11600 (598 bytes)
      2010-12-23 10:55:00,368 124256 TRACE [org.jgroups.protocols.TCP] (Incoming-1,infinispan-cluster-OHServerHybrid-test16,test16_devenv2-53120:) received [dst: <null>, src: test16_devenv2-53120 (3 headers), size=494 bytes], headers are RequestCorrelator: id=201, type=REQ, id=3, rsp_expected=false, NAKACK: [MSG, seqno=4], TCP: [channel_name=infinispan-cluster-OHServerHybrid-test16]
      
      

       

      The items in red (key=5) represent the wide row.  You can see other items (key=6, key=2, key=1) that are in blue.  Items in blue get to the B side.  The item in red does not.

       

      Here is the B side (receiving side):

       

      2010-12-23 10:55:00,359 17676491 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,359 17676491 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,369 17676501 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,369 17676501 TRACE [org.jgroups.protocols.TCP] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16_devenv1-61733:) received [dst: <null>, src: test16_devenv2-53120 (3 headers), size=494 bytes], headers are RequestCorrelator: id=201, type=REQ, id=3, rsp_expected=false, NAKACK: [MSG, seqno=4], TCP: [channel_name=infinispan-cluster-OHServerHybrid-test16]
      2010-12-23 10:55:00,369 17676501 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      2010-12-23 10:55:00,369 17676501 TRACE [org.jgroups.protocols.pbcast.NAKACK] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16_devenv1-61733:) test16_devenv1-61733: received test16_devenv2-53120#4
      2010-12-23 10:55:00,369 17676501 TRACE [org.jgroups.protocols.FC] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16_devenv1-61733:) sender test16_devenv2-53120 minus
      494 credits, 1999479 remaining
      2010-12-23 10:55:00,370 17676502 TRACE [org.jgroups.blocks.RequestCorrelator] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16_devenv1-61733:) calling (org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher) with request 3
      2010-12-23 10:55:00,372 17676504 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16_devenv1-61733:) Attempting to execute command: MultipleRpcCommand{commands=[PutKeyValueCommand{key=6, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@7139d335, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=1, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@b5976a6, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=2, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@1e7c921c, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='GenericCache'} [sender=test16_devenv2-53120]
      2010-12-23 10:55:00,372 17676504 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16_devenv1-61733:) Enough waiting; replayIgnored = false, sr STATE_PREEXISTED
      2010-12-23 10:55:00,372 17676504 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16_devenv1-61733:) Calling perform() on MultipleRpcCommand{commands=[PutKeyValueCommand{key=6, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@7139d335, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=1, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@b5976a6, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=2, value=com.cboe.server.distributedCache.DistributeCacheValueBaseImpl@1e7c921c, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='GenericCache'}
      2010-12-23 10:55:00,379 17676511 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-0:) flush(): flushing repl queue (num elements=0)
      
       
      

      The items just above in blue again seem to support that those keys (6, 2, and 1) are received, but the key=5 never is.  I am not sure if the NAKACK is significant which I highlighted in green.  Otherwise there is no evidence of key=5 received at all by this app.

       

      We are in performance testing and just uncovered this functional issue causing a significant jeopardy. 

        • 1. Re: Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
          galder.zamarreno

          If you know the size of the array and the contents, can you replicate this in a basic unit test with the same Infinispan and JGroups configuration file? Even if you don't know the contents, try with an array of that size containg a simple serializable Pojo or something. You can use the maven archetypes we've just published in to help you write the unit test: http://community.jboss.org/docs/DOC-16201

          • 2. Re: Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
            cbo_

            I have worked up a unit test, but so far have not been able to recreate the problem.  This is the case for the unit test despite the problem occurs with regularity in the real app.  The thing that seems really strange about the logging I am seeing from the real app is that the connection to the cache occurs at one moment which would be followed by a clear() in the code and then followed by the put (all in the same thread).  However, in the logs (jgroups details) I see the connection to the cache followed by the put (of the problem entry) which is then followed by the clear (a whopping 7 seconds later).  This is the case both on my master/primary side as well as the slave/passive side.  Strange.  Still trying to recreate the problem in my unit test.  Just wanted to pass along this info to you guys in case you might think of some reason why the clear would get async'd in a replicationqueue scenario.

            • 3. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
              cbo_

              I continue to be unable to reproduce this scenario in my unit test.  However, I added some additional logging to the real app and discovered a detail that I really need to point out since prior information was misleading.

               

              The real app is connecting to and working with more than one cache.  I neglected to notice that the ClearCommand that I saw in the logs occurring after the put was from a different cache.  So, the problem is very much still there but the order of things both in the app and in the logs is as follows:

                   - restart the primary app A

                   - the passive app B remains

                   - primary app A does a clear()

                   - clear is there in the jgroups logs on A

                   - clear is there in the jgroups logs on B

                   - primary app A does a put on the wide row (key=5)

                   - the put of the wide row is there in the jgroups logs on A

                   - primary app A does a put on rows with keys 6, 1, 2

                   - the put of the rows with keys 6, 1, 2 are there in the jgroups logs on A

                   - the put of the rows with keys 6, 1, 2 are there in the jgroups logs on B (never saw key 5)

              • 4. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                manik

                Is it just a case that the "wide row" takes some time?  I.e., if it is large, serializing it may be slow - as would transmitting those bytes across the network.  Do you see it eventually turning up on B?

                 

                Also, how big are we talking?  Could you try measuring how long it takes to serialize the wide row, and how many bytes you end up with?

                 

                Also, are you using TCP or UDP?  UDP packet sizes may have something to do with this, along with tuning your OS network stack.  Also, does your JGroups stack have bundling enabled?

                 

                Cheers

                Manik

                • 5. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                  mircea.markus

                  I'm not sure I understand the setup:

                  1. A runs and is updated. Key5 (the big row is in A at this point).

                  2. B starts. stateTransfer brings key5 to B correctly.

                  can you please detail next steps?

                  • 6. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                    cbo_

                    Complete list is this:

                     

                    1. A starts up and does a cache.clear()

                    2. A does put on key=5 (the big row), then put on key=6, then key=1, then key=2

                    3. B starts up in passive mode.  Connects to the cache and retrieves all rows via stateTransfer correctly

                    4. We decide to restart A at this point and intend to leave B in passive mode.

                    5. When A restarts he does a cache.clear() which clears entries on the B side as well.

                    6. A does put on key=5 (the big row), then 6, 2, and 1.

                    7. There is evidence in A's log file that replicationQueue receives and flushes key5 as well as 6, 2, and 1

                    8. The log on the B side indicates the clear that was done and then only receiving 6, 2, and 1.  Nothing for key=5!

                    • 7. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                      mircea.markus

                      Thanks for the detailed description.

                      Re:8. Just wondering weather the clear(step 5) is finished done when key 6 is migrated across. Or there's no reference of 6 being migrated at all? Can you please check that?

                      • 8. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                        cbo_

                        That very same thing has been the focus of my attention.  It is definitely the case that key=5 (the wide row) is never received on B.  It does not appear to be an out-of-order issue between the clear and the receiving of key=5.

                         

                        In fact, if you notice steps 1 through 8 above do not restart application B (only restart is on A).  While reviewing the log for the passive application B you will only see key=5 (the wide row) received once (which would be from the stateTransfer done in step 3).  While searching on the other keys (6, 2, 1), you will find then in the log twice (once for stateTransfer in step 3, and once in step 8 after the clear is received).  So, key=5 never appears to be received on B.

                        • 9. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                          cbo_

                          Perhaps we are not exposing the details we need.  Here are the logging levels we are using:

                           

                             <category name="org.jboss.system">

                               <priority value="TRACE"/>

                             </category>

                             -->

                             <category name="org.infinispan">

                                <priority value="TRACE"/>

                             </category>

                             <category name="org.jboss.serial">

                                <priority value="INFO"/>

                             </category>

                           

                           

                             <category name="org.jboss.tm">

                                <priority value="WARN"/>

                             </category>

                           

                           

                           

                           

                             <!-- Entries below were set to TRACE during debugging...set back to INFO for normal processing, EXCEPT where noted -->

                             <category name="org.infinispan.statetransfer">

                                <priority value="TRACE"/>

                             </category>

                             <category name="org.infinispan.remoting">

                                <priority value="TRACE"/>

                             </category>

                             <category name="org.jgroups.protocols.MPING">

                                <priority value="TRACE"/>

                             </category>

                             <category name="org.jgroups.protocols.pbcast.GMS">

                                <priority value="TRACE"/>

                             </category>

                             <category name="org.jgroups.protocols.FD_SOCK">

                                <priority value="TRACE"/>

                             </category>

                             <!--  NOTE: only used INFO while debugging on org.jgroups.protocols.pbcast.FLUSH since it puts out lots of dat -->

                             <category name="org.jgroups.protocols.pbcast.FLUSH">

                                <priority value="INFO"/>

                             </category>

                           

                             <category name="org.jgroups.protocols.MERGE2">

                                <priority value="TRACE"/>

                             </category>

                             <category name="org.jgroups.protocols.FD">

                                <priority value="TRACE"/>

                             </category>

                             <category name="org.jgroups.protocols.VERIFY_SUSPECT">

                                <priority value="TRACE"/>

                             </category>

                             <category name="org.jgroups">

                                <priority value="TRACE"/>

                             </category>

                          • 10. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                            mircea.markus

                            trace on org.infinispan should  show what happens with key5.. Can you update ReplicationQueueImpl (e.g. in 4.2.1 branch) with detailed information about each element that is taken from the queue and then send over the wire? Having jgroups logs enabled would allow us to see if the package fails to be sent at some point.  

                            • 11. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                              cbo_

                              Not sure I understand.  I have org.jgroups set to TRACE already (described above).  Is there a different setting you are suggesting?  And, does the ReplicationQueueImpl in 4.2.1 output detailed info or are you asking that I modify it to do that?

                              • 12. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                                mircea.markus
                                Is there a different setting you are suggesting?  And, does the ReplicationQueueImpl in 4.2.1 output detailed info or are you asking that I modify it to do that?

                                I was suggesting to modify the source code to add more debug. Just to see where these entry gets lost.

                                • 13. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                                  cbo_

                                  I already have org.jgroups set to TRACE.  Is there some other package or class I can set to TRACE to expose the details we need to diagnose this?  I am thinking it is never received by the passive side (B).  Is there something we can set to TRACE to prove that?  What class/method is the earliest point of entry on a receiving side?

                                  • 14. Problem with a PUT on a large row in 4.2.0.FINAL - works with stateTransfer
                                    mircea.markus

                                    ReplicationQueueImpl.flush takes the elements out of the queue. I'd add an log there to see what elements are taken out of the queue, and how are they send over: this way we can identify when key5 is taken out and what the async thread does with it.

                                    1 2 Previous Next