10 Replies Latest reply on Apr 1, 2010 9:31 AM by yelin66

    exceptions during parallel updates

    yelin66

      I was testing Infinispan with parallel updates from multiple cache instances, and getting couple exceptions intermittently.

       

      1. The following replication timeout exception occurred in about 10 seconds after the update was kicked off while my replTimeout was set much longer than that. Also, when the update succeeded on average that update took around 60 milliseconds only. Could you please suggest what may potentically cause the timeout?

      13:20:53,888 | INFO  | Thread-14        | ReadUpdateTestDriver             | t.drivers.ReadUpdateTestDriver$5  341 | updating 2500 entries
      13:21:03,230 | ERROR | Thread-14        | InvocationContextInterceptor     | ors.InvocationContextInterceptor   55 | Execution error:
      org.infinispan.util.concurrent.TimeoutException: Replication timeout for SSELabCl07-53048
          at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:421)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:100)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:124)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:229)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:216)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:212)
          at org.infinispan.interceptors.DistributionInterceptor.handleWriteCommand(DistributionInterceptor.java:304)
          at org.infinispan.interceptors.DistributionInterceptor.visitPutMapCommand(DistributionInterceptor.java:168)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.LockingInterceptor.visitPutMapCommand(LockingInterceptor.java:210)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:132)
          at org.infinispan.commands.AbstractVisitor.visitPutMapCommand(AbstractVisitor.java:73)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.TxInterceptor.enlistWriteAndInvokeNext(TxInterceptor.java:183)
          at org.infinispan.interceptors.TxInterceptor.visitPutMapCommand(TxInterceptor.java:152)
          at org.infinispan.interceptors.DistTxInterceptor.visitPutMapCommand(DistTxInterceptor.java:104)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:48)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:34)
          at org.infinispan.commands.AbstractVisitor.visitPutMapCommand(AbstractVisitor.java:73)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:269)
          at org.infinispan.CacheDelegate.putAll(CacheDelegate.java:447)
          at org.infinispan.CacheDelegate.putAll(CacheDelegate.java:216)
          at ge.energy.smartgrid.applications.omtest.controllers.InfinispanCacheController.putAll(InfinispanCacheController.java:180)
          at ge.energy.smartgrid.applications.omtest.drivers.ReadUpdateTestDriver$5.run(ReadUpdateTestDriver.java:370)
      13:21:03,232 | ERROR | Thread-14        | jgroups                          | org.jgroups.util.Util$1            76 | uncaught exception in Thread[Thread-14,5,Thread Pools] (thread group=org.jgroups.util.Util$1[name=JGroups,maxpri=10] )
      org.infinispan.util.concurrent.TimeoutException: Replication timeout for SSELabCl07-53048
          at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:421)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:100)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:124)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:229)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:216)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:212)
          at org.infinispan.interceptors.DistributionInterceptor.handleWriteCommand(DistributionInterceptor.java:304)
          at org.infinispan.interceptors.DistributionInterceptor.visitPutMapCommand(DistributionInterceptor.java:168)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.LockingInterceptor.visitPutMapCommand(LockingInterceptor.java:210)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:132)
          at org.infinispan.commands.AbstractVisitor.visitPutMapCommand(AbstractVisitor.java:73)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.TxInterceptor.enlistWriteAndInvokeNext(TxInterceptor.java:183)
          at org.infinispan.interceptors.TxInterceptor.visitPutMapCommand(TxInterceptor.java:152)
          at org.infinispan.interceptors.DistTxInterceptor.visitPutMapCommand(DistTxInterceptor.java:104)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:48)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:34)
          at org.infinispan.commands.AbstractVisitor.visitPutMapCommand(AbstractVisitor.java:73)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:269)
          at org.infinispan.CacheDelegate.putAll(CacheDelegate.java:447)
          at org.infinispan.CacheDelegate.putAll(CacheDelegate.java:216)
          at ge.energy.smartgrid.applications.omtest.controllers.InfinispanCacheController.putAll(InfinispanCacheController.java:180)
          at ge.energy.smartgrid.applications.omtest.drivers.ReadUpdateTestDriver$5.run(ReadUpdateTestDriver.java:370)
      13:21:03,492 | WARN  | bCluster01-36455 | NAKACK                           | .jgroups.protocols.pbcast.NAKACK  965 | (requester=SSELabClusterMaster-58229, local_addr=LabCluster01-36455) message SSELabCl04-17481::7670 not found in retransmission table of SSELabCl04-17481:
      [7654 : 7671 (7671)]
      13:21:03,492 | WARN  | bCluster01-36455 | NAKACK                           | .jgroups.protocols.pbcast.NAKACK  965 | (requester=SSELabCl07-53048, local_addr=LabCluster01-36455) message SSELabCl04-17481::7670 not found in retransmission table of SSELabCl04-17481:
      [7654 : 7671 (7671)]
      13:21:03,493 | WARN  | bCluster01-36455 | NAKACK                           | .jgroups.protocols.pbcast.NAKACK  965 | (requester=SSELabCl05-11846, local_addr=LabCluster01-36455) message SSELabCl04-17481::7670 not found in retransmission table of SSELabCl04-17481:
      [7654 : 7671 (7671)]
      13:21:03,492 | WARN  | bCluster01-36455 | NAKACK                           | .jgroups.protocols.pbcast.NAKACK  965 | (requester=SSELabCl08-8505, local_addr=LabCluster01-36455) message SSELabCl04-17481::7670 not found in retransmission table of SSELabCl04-17481:

       

      2. Got the following SuspectException, but not sure what it means. Could you please suggest?

      11:35:20,063 | WARN  | SSELabCl02-54447 | GMS                              | org.jgroups.protocols.pbcast.GMS  562 | SSELabCl02-54447: not member of view [SSELabClusterMaster-6376|10] [SSELabClusterMaster-6376, LabCluster01-42248, SSELabCl03-25876, SSELabCl04-52466, SSELabCl05-1712, SSELabCl06-47092, SSELabCl08-59019, SSELabCl09-22842]; discarding it
      11:35:22,181 | WARN  | SSELabCl02-54447 | GMS                              | org.jgroups.protocols.pbcast.GMS  562 | SSELabCl02-54447: not member of view [SSELabClusterMaster-6376|11] [SSELabClusterMaster-6376]; discarding it
      11:35:24,090 | ERROR | Thread-8         | InvocationContextInterceptor     | ors.InvocationContextInterceptor   55 | Execution error:
      org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: SSELabCl08-59019
          at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:418)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:100)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:124)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:229)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:216)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:212)
          at org.infinispan.interceptors.DistributionInterceptor.handleWriteCommand(DistributionInterceptor.java:304)
          at org.infinispan.interceptors.DistributionInterceptor.visitPutMapCommand(DistributionInterceptor.java:168)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.LockingInterceptor.visitPutMapCommand(LockingInterceptor.java:210)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:132)
          at org.infinispan.commands.AbstractVisitor.visitPutMapCommand(AbstractVisitor.java:73)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.TxInterceptor.enlistWriteAndInvokeNext(TxInterceptor.java:183)
          at org.infinispan.interceptors.TxInterceptor.visitPutMapCommand(TxInterceptor.java:152)
          at org.infinispan.interceptors.DistTxInterceptor.visitPutMapCommand(DistTxInterceptor.java:104)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:48)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:34)
          at org.infinispan.commands.AbstractVisitor.visitPutMapCommand(AbstractVisitor.java:73)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:269)
          at org.infinispan.CacheDelegate.putAll(CacheDelegate.java:447)
          at org.infinispan.CacheDelegate.putAll(CacheDelegate.java:216)
          at ge.energy.smartgrid.applications.omtest.controllers.InfinispanCacheController.putAll(InfinispanCacheController.java:180)
          at ge.energy.smartgrid.applications.omtest.drivers.ReadUpdateTestDriver$2.run(ReadUpdateTestDriver.java:182)
      11:35:24,092 | ERROR | Thread-8         | jgroups                          | org.jgroups.util.Util$1            76 | uncaught exception in Thread[Thread-8,5,Thread Pools] (thread group=org.jgroups.util.Util$1[name=JGroups,maxpri=10] )
      org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: SSELabCl08-59019
          at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:418)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:100)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:124)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:229)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:216)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:212)
          at org.infinispan.interceptors.DistributionInterceptor.handleWriteCommand(DistributionInterceptor.java:304)
          at org.infinispan.interceptors.DistributionInterceptor.visitPutMapCommand(DistributionInterceptor.java:168)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.LockingInterceptor.visitPutMapCommand(LockingInterceptor.java:210)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:132)
          at org.infinispan.commands.AbstractVisitor.visitPutMapCommand(AbstractVisitor.java:73)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.TxInterceptor.enlistWriteAndInvokeNext(TxInterceptor.java:183)
          at org.infinispan.interceptors.TxInterceptor.visitPutMapCommand(TxInterceptor.java:152)
          at org.infinispan.interceptors.DistTxInterceptor.visitPutMapCommand(DistTxInterceptor.java:104)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:48)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:34)
          at org.infinispan.commands.AbstractVisitor.visitPutMapCommand(AbstractVisitor.java:73)
          at org.infinispan.commands.write.PutMapCommand.acceptVisitor(PutMapCommand.java:64)
          at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:269)
          at org.infinispan.CacheDelegate.putAll(CacheDelegate.java:447)
          at org.infinispan.CacheDelegate.putAll(CacheDelegate.java:216)
          at ge.energy.smartgrid.applications.omtest.controllers.InfinispanCacheController.putAll(InfinispanCacheController.java:180)
          at ge.energy.smartgrid.applications.omtest.drivers.ReadUpdateTestDriver$2.run(ReadUpdateTestDriver.java:182)

        • 1. Re: exceptions during parallel updates
          manik

          This bit is probably key:

          Execution error:

          org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: SSELabCl08-59019
              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:418)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:100)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:124)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:229)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:216)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:212)

          Does your test involve nodes failing or processes being killed midway?  If not, what does your JGroups config look like (or are you using one of the default JGroups stacks we ship)?  Just wondering whether the JGroups failure detection protocol is being too picky, reporting node failures when it should not.

           

          Also, do you have logs on the receiver end about the time this happens on the sender?  You should be able to find the relevant thread based on the key being updated, and the sender's name....

          • 2. Re: exceptions during parallel updates
            yelin66

            Manik, thanks for the suggestions. I kind of know what's causing the SuspectException and could avoid that. However, I am concerned about what happened before the exception. Basically, on the node with the log I sent before, above the exception logging, there is something like:

             

            11:35:20,063 | WARN  | SSELabCl02-54447 | GMS                              | org.jgroups.protocols.pbcast.GMS  562 | SSELabCl02-54447: not member of view [SSELabClusterMaster-6376|10] [SSELabClusterMaster-6376, LabCluster01-42248, SSELabCl03-25876, SSELabCl04-52466, SSELabCl05-1712, SSELabCl06-47092, SSELabCl08-59019, SSELabCl09-22842]; discarding it

             

            The SSELabCl02-54447 is the above node itself, but it's not listed in the view. Around the same time on the other node, there is a LEAVE event indicating the above node has just left. In my code, the other node detected the node drop and exited the test. That's why the SuspectException was thrown. I can change my code, so the other node wouldn't exit and avoid the SuspectException. However, the warning on the local node not being in the cluster view and the LEAVE event on the remote node is my concern. What may cause that? What would happen if data are being updated during that situation? Could it end up with inconsistent data on different nodes?

             

            I am using the default udp.xml coming with jgroups-2.9.0.GA.jar. I would appreciate your reply.

            • 3. Re: exceptions during parallel updates
              manik

              I've pinged the JGroups core devs for some more help on this, they ought to respond soon regarding the node not being listed in the view.

               

              When you say "the other node detected the node drop", have you registered a listener and are acting on a view change event?  Also, how does the node exit the test - is this a clean shutdown?  Clean shutdowns should not cause SuspectExceptions...

               

              If nodes are being updated during such network inconsistencies, you should not have data loss since (a) there is a small probability that the node being affected is where your data is mapped and (b) your data is also mapped to other, healthy nodes as well as a backup.  We just need to make sure such a network inconsistency doesn't prevent the backup RPC from taking place as well.

              • 4. Re: exceptions during parallel updates
                belaban

                Here's what happens: let's say we have member A, B , C and D

                • The current view is {A,B,C,D}
                • Now C is slow, responding to heartbeats, is suspected (by B) and excluded from the cluster (by A)
                • A, B and D now install the new view {A,B,D}
                • C does get this view (as we use multicasting), but rejects it, as it is not in this view
                • So C's view is still {A,B,C,D}
                • The effect of this is that A,B and D will not accept any messages from D (as it is not in their view), however, C will accept messages from {A,B,D}
                • After some time, MERGE2 will merge {A,B,D} and {A,B,C,D} back into a single view.

                 

                This is what's called an overlapping merge. When this happens, members talk to each other before the merge, and agree on a common set of messages that have been seen by everyone. Then all messages not in this set will be re-sent, so that - at the end of the merge - everyone's on the same page.

                 

                To reduce spurious suspicions and exclusions, I recommend to use FD_SOCK plus FD_ALL (or FD), the latter with a high timeout value. See details at http://community.jboss.org/docs/DOC-9739

                1 of 1 people found this helpful
                • 5. Re: exceptions during parallel updates
                  yelin66

                  Bela, thanks a lot for the detailed explanation. I'll try what you recommended. Also, I'd like to confirm with you on one thing - in case it happened, after the MERGE process you explained, there won't be any inconsistent data, is that correct?

                   

                  And one more thing, could you please take a look at the Replication timeout exception described in the first message of this thread, and suggest how to avoid it? I appreciate your response.

                  • 6. Re: exceptions during parallel updates
                    belaban

                    Yes, after a merge, there should not be any inconsistent data. Having said that, the application using JGroups must of course handle the merge as well, e.g. merge data back together in a meaningful way.

                     

                    The timeout exception was caused by Infinispan, I guess because there wasn't an answer for more than N seconds. If you look at my explanation above, there are situations where a member doesn't respond: for example when it receives a message from someone *not* in its view, then it'll simply drop that message. Meaning, the caller won't get any response, and will time out.

                    • 7. Re: exceptions during parallel updates
                      yelin66

                      Bela, thanks for further explanation. It makes a lot of sense. I updated my JGroups configuration per your suggestions, and it seems working.

                       

                      One last question on the merge in case it's needed. In my case, the application using JGroups is Infinispan, and Infinispan would handle the merge in a meaningful way and it should be transparent to the users of Infinispan, is that correct?

                      • 8. Re: exceptions during parallel updates
                        belaban

                        That's a question for the Infinispan folks. This also depends on whether you use replication or distribution...

                        • 9. Re: exceptions during parallel updates
                          manik

                          Lin - we don't handle split brains (the MERGE case) at the moment, because it is hard to assume how state on two parallel streams of updates can be combined without domain knowledge.  We have an open JIRA for this with some suggestions on how this could be done (ISPN-263), if it is that critical to you vote for it and we can see if we can include it in an earlier release.

                          • 10. Re: exceptions during parallel updates
                            yelin66

                            Fair enough. Thanks a lot for the responses, guys!