4 Replies Latest reply on Oct 21, 2008 4:50 AM by manik

    Synchronous replication result with suspect member

    afelle1

      QUESTION: Whenever putting an attribute / node into the cache and a replication exception occurs due to time out or a remote machine being suspect, is the attribute / node put into the cache on the other machines that did respond?

      We have our cache configured for synchronous replication with a cluster of 5 initial members. Occasionally, we receive a stack trace in the error log for replication errors involving one of the members being timed out or suspected, however I am unsure as to the state of the cache. These occur whenever we are putting a new attribute within the cache, so I want to know whether the attribute was put into the cache on all machines except for the suspected / timed out machine or if wasn't put into the cache on any of the machines.

      JGroups configuration (sans TCP and TCPPING)

      <MERGE2 min_interval="5000" max_interval="10000" />
      <FD_SOCK />
      <FD shun="true" timeout="2500" max_tries="5" />
      <VERIFY_SUSPECT timeout="1500" />
      <UNICAST retransmit_timeout="300,600,1200,2400,4800,9600" />
      <pbcast.STABLE desired_avg_gossip="20000" stability_delay="1500" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" />
      <pbcast.STATE_TRANSFER />


      Stack Trace
      2008-09-29 00:02:21,740 ERROR [org.jasig.cas.ticket.registry.JBossCacheTicketRegistry] - org.jboss.cache.ReplicationException: rsp=sender=XXX.XXX.XXX.XXX:XXXX, retval=null, received=false, suspected=false
      org.jboss.cache.ReplicationException: rsp=sender=XXX.XXX.XXX.XXX:XXXX, retval=null, received=false, suspected=false
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4338)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4260)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4372)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
      at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:124)
      at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:88)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
      at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:364)
      at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:160)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
      at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:157)
      at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5710)
      at org.jboss.cache.TreeCache.put(TreeCache.java:3782)
      at org.jboss.cache.TreeCache.put(TreeCache.java:3720)


      Thank you for any assistance!
      Andrew

        • 1. Re: Synchronous replication result with suspect member
          afelle1

          Stack trace on machine A:

          2008-09-29 00:02:21,740 ERROR [org.jasig.cas.ticket.registry.JBossCacheTicketRegistry] - org.jboss.cache.ReplicationException: rsp=sender=XXX.XXX.XXX.XXX:XXXX, retval=null, received=false, suspected=false
          org.jboss.cache.ReplicationException: rsp=sender=XXX.XXX.XXX.XXX:XXXX, retval=null, received=false, suspected=false
          at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4338)
          at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4260)
          at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4372)
          at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
          at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
          at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:124)
          at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:88)
          at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
          at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:364)
          at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:160)
          at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
          at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:157)
          at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5710)
          at org.jboss.cache.TreeCache.put(TreeCache.java:3782)
          at org.jboss.cache.TreeCache.put(TreeCache.java:3720)
          ... X more
          Caused by: org.jboss.cache.lock.TimeoutException: Response timed out: sender=XXX.XXX.XXX.XXX:XXXX, retval=null, received=false, suspected=false
          at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4336)
          ... 72 more


          Stack trace on machine B:
          2008-09-29 08:14:34,471 INFO [org.jasig.cas.authentication.AuthenticationManagerImpl] - AuthenticationHandler: org.jasig.cas.adaptors.ldap.BindLdapAuthenticationHandler successfully authenticated the user which prov
          ided the following credentials: kattest
          2008-09-29 08:14:36,869 ERROR [org.jasig.cas.ticket.registry.JBossCacheTicketRegistry] - org.jboss.cache.ReplicationException: rsp=sender=YYY.YYY.YYY.YYY:YYYY, retval=null, received=false, suspected=true
          org.jboss.cache.ReplicationException: rsp=sender=YYY.YYY.YYY.YYY:YYYY, retval=null, received=false, suspected=true
          at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4338)
          at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4260)
          at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4372)
          at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
          at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
          at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:124)
          at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:88)
          at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
          at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:364)
          at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:160)
          at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
          at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:157)
          at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5710)
          at org.jboss.cache.TreeCache.put(TreeCache.java:3782)
          at org.jboss.cache.TreeCache.put(TreeCache.java:3720)
          ... N more
          Caused by: org.jboss.cache.SuspectException: Response suspected: sender=YYY.YYY.YYY.YYY:YYYY, retval=null, received=false, suspected=true
          at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4332)
          ... 72 more


          • 2. Re: Synchronous replication result with suspect member
            afelle1

            Oh, I'm using JBoss Cache 1.4.1 GA and JGroups 2.4.1

            • 3. Re: Synchronous replication result with suspect member
              afelle1

              Bumping thread as I would really appreciate feedback

              • 4. Re: Synchronous replication result with suspect member
                manik

                 

                "afelle1" wrote:
                QUESTION: Whenever putting an attribute / node into the cache and a replication exception occurs due to time out or a remote machine being suspect, is the attribute / node put into the cache on the other machines that did respond?

                We have our cache configured for synchronous replication with a cluster of 5 initial members. Occasionally, we receive a stack trace in the error log for replication errors involving one of the members being timed out or suspected, however I am unsure as to the state of the cache. These occur whenever we are putting a new attribute within the cache, so I want to know whether the attribute was put into the cache on all machines except for the suspected / timed out machine or if wasn't put into the cache on any of the machines.


                This depends on your cfg. You said you are using sync replication, but are you running in a transaction? And if so, are you using sync commit phase?

                Basically, if:

                1) You are not running in a TX, then the state of all nodes *could* be out of sync. I.e., the put() could have completed on node 1 and node 2. Node 3 could fail, so your app sees an exception, but it has succeeded on node 1 and node 2 and there is no way to roll back.

                2) If you were running in a TX, when node 3 fails, the node propagating the change will broadcast a rollback to node 1 and node 2 - but only IF node 3 fails in the prepare phase of a 2-phase-commit.

                3) If node 3 fails during the commit phase, the outcome is indeterminate, as dictated by JTA. I.e., nodes 1 and 2 have been asked to commit state and while attempting to tell node 3 to commit, we have a suspect exception. There is no way to tell nodes 1 and 2 to rollback, after we have already told them to commit.

                HTH,
                Manik