8 Replies Latest reply on Mar 3, 2014 9:58 AM by Vijay Bhaskar Chintalapati

    Cross Data Center Replication Issues with JDG 6.2

    Vijay Bhaskar Chintalapati Newbie

      I have been trying to setup a demo of Cross Data Center Replication using JDG 6.2 and so far haven't been able to get it to work properly. I need help to understand what could be fixed to get the testing steps listed below work correctly and successfully.

       

      SETUP (please refer to the attachments)

      • Single VM (Mac OS X) (with IP address configured to be 192.168.1.5 for testing purposes)
      • Two sites: site-1 and site-2 and a cluster (of 3 JDG 6.2 nodes), one for each site
      • One distributed cache labCache configured to be backed-up in under its own cache-container: xsite
      • All the JDG nodes are launched with a difference of 100 for port-offset. So, node-1 is given a port offset of 100, node-2 200, and so on with node-6 a value of 600
      • Each node is given a copy of JDG 6.2 standalone folder. So node-1 runs runs in folder standalone1 and so on with node-6 runs in standalone6
      • Nodes under site-1 cluster use a multicast port of 239.1.1.1 and nodes under site-2 use 239.2.2.2
      • With different multicast port settings for two clusters, the discovery of one cluster from other happens via MPING using address: 234.99.54.14 and port: 12000


      TESTING STEPS

      1. Bring both sites up and ensure clustering of nodes under each site
      2. Use the Infinispan CLI to push the data of 100 entries into the cache hosted by node-1. Ensure distribution across cluster and back up across the site

        ./ispn-cli.sh -c remoting://192.168.1.5:10099/xsite/labCache -f input-data-site-1.txt

      3. Use the Infinispan CLI again to log in into the same node and this time just issue the  clear command and ensure the erasure of entries for all the caches
      4. Repeat step #2

       

      FAILURE

      The failure happened at step #4 where repopulating the cache with same information just dragged on for several minutes with the node 1 eventually timing out with CLI and with various warnings and exceptions as shown below

       

      WARNINGS AND EXCEPTIONS

      Following are the various warning and exceptions that are found repeatedly:

      standalone1.log:

      19:54:51,027 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-18) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Replication timeout for jdg-2/site-1

      standalone4.log:

      19:54:02,268 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-2) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [30 seconds] on key [0] for requestor [Thread[remote-thread-2,5,main]]! Lock held by [Thread[remote-thread-0,5,main]]

      standalone6.log:

      19:54:45,728 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (OOB-7,shared=relay) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Replication timeout for jdg-4/site-2

      standalone6.log:

      19:54:45,765 WARN  [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-3,shared=relay) ISPN000071: Caught exception when handling command SingleRpcCommand{cacheName='labCache', command=ClearCommand{flags=[IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP]}}: org.infinispan.util.concurrent.TimeoutException: Replication timeout for jdg-4/site-2

      standalone6.log:

      19:54:02,339 WARN  [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (remote-thread-8) ISPN000071: Caught exception when handling command SingleRpcCommand{cacheName='labCache', command=ClearCommand{flags=[IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP]}}: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [30 seconds] on key [2] for requestor [Thread[remote-thread-8,5,main]]! Lock held by [Thread[remote-thread-3,5,main]]

      standalone5.log:

      19:54:53,532 WARN  [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-187,shared=tcp) ISPN000071: Caught exception when handling command SingleRpcCommand{cacheName='labCache', command=PutKeyValueCommand{key=0, value=Some data, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}}: org.infinispan.util.concurrent.TimeoutException: Node jdg-4/site-2 timed out

        • 1. Re: Cross Data Center Replication Issues with JDG 6.2
          Radim Vansa Master

          How many times have you issued the clear command? It seems that there are many clear commands running in parallel, and as this command has to acquire locks for all entries in cache, multiple clears would content against each other.

           

          Can you also setup TRACE level logging on org.infinispan and post the full logs? (probably you'd have to store the zipped logs somewhere on external location such as DropBox)

          • 2. Re: Cross Data Center Replication Issues with JDG 6.2
            Vijay Bhaskar Chintalapati Newbie

            The clear was issued just once and only on node-1. I will upload the logs with TRACE enabled to a site and let you know shortly.

            • 3. Re: Re: Cross Data Center Replication Issues with JDG 6.2
              Vijay Bhaskar Chintalapati Newbie

              Attached is the ZIP file of all the necessary logs. If it helps, I am listing out history of the command and time when I am executing the command just to show you how the last command almost gets stuck for a long time.

              PROMPT:bin>./ispn-cli.sh -c remoting://192.168.1.6:10099/xsite/labCache -f input-data-site-1.txt

              Mar 03, 2014 12:47:12 AM org.xnio.Xnio <clinit>

              INFO: XNIO Version 3.0.7.GA-redhat-1

              Mar 03, 2014 12:47:12 AM org.xnio.nio.NioXnio <clinit>

              INFO: XNIO NIO Implementation Version 3.0.7.GA-redhat-1

              Mar 03, 2014 12:47:12 AM org.jboss.remoting3.EndpointImpl <clinit>

              INFO: JBoss Remoting version 3.2.16.GA-redhat-1

              PROMPT:bin>./ispn-cli.sh -c remoting://192.168.1.6:10099/xsite/labCache

              Mar 03, 2014 12:47:36 AM org.xnio.Xnio <clinit>

              INFO: XNIO Version 3.0.7.GA-redhat-1

              Mar 03, 2014 12:47:36 AM org.xnio.nio.NioXnio <clinit>

              INFO: XNIO NIO Implementation Version 3.0.7.GA-redhat-1

              Mar 03, 2014 12:47:36 AM org.jboss.remoting3.EndpointImpl <clinit>

              INFO: JBoss Remoting version 3.2.16.GA-redhat-1

              [remoting://192.168.1.6:10099/xsite/labCache]> clear

              [remoting://192.168.1.6:10099/xsite/labCache]> quit

              PROMPT:bin>./ispn-cli.sh -c remoting://192.168.1.6:10099/xsite/labCache -f input-data-site-1.txt

              Mar 03, 2014 12:48:14 AM org.xnio.Xnio <clinit>

              INFO: XNIO Version 3.0.7.GA-redhat-1

              Mar 03, 2014 12:48:14 AM org.xnio.nio.NioXnio <clinit>

              INFO: XNIO NIO Implementation Version 3.0.7.GA-redhat-1

              Mar 03, 2014 12:48:14 AM org.jboss.remoting3.EndpointImpl <clinit>

              INFO: JBoss Remoting version 3.2.16.GA-redhat-1

              Node jdg-3/site-1 timed out

              Node jdg-3/site-1 timed out

              Node jdg-3/site-1 timed out

              ^Cjava.io.IOException: Connection Ended

              org.jboss.remoting3.NotOpenException: Writes closed

               

              ~~~~~~~~ Multiple repeat of Writes closed due to force termination ~~~~~~~~~

               

              PROMPT:bin>date

              Mon Mar  3 00:51:02 PST 2014

              PROMPT:bin>

              • 4. Re: Re: Cross Data Center Replication Issues with JDG 6.2
                Radim Vansa Master

                Thanks, the logs show that this is really an issue. The ClearCommand is propagated to the backup site multiple times, which results in deadlock (when it is started on both node A and B in parallel, each one locks all local entries and then attempts to lock entries on the other node).

                 

                However, I still can't understand why some of the replications occur - in fact I've seen replicating the command from backup to main cache. Would you be so kind and run it once more with TRACE level on org.jgroups as well (in addition to org.infinispan)? Just load the first 100 entries and then issue the clear command.

                • 5. Re: Cross Data Center Replication Issues with JDG 6.2
                  Vijay Bhaskar Chintalapati Newbie

                  I am on it! Meanwhile, if you could, can you please take a look at the Infinispan subsystems in site-1.xml and site-2.xml (attached to the original post) to check if the configurations of the backups is all correct.

                   

                  For Site-1, Site 2 is configured as backup and for Site-2, Site-1 is configured as backup. Could that be the reason why the commands are looping between the sites incessantly ?

                  • 6. Re: Cross Data Center Replication Issues with JDG 6.2
                    Radim Vansa Master

                    Backing up sites in the same fashion is supported, but it seemed rather that the message which was broadcast in site-2 somehow made it into site-1. I'll see more from JGroups, this way I am just guessing according to timestamps.

                    • 7. Re: Re: Cross Data Center Replication Issues with JDG 6.2
                      Radim Vansa Master

                      So, it looks like a problem in documentation - I have thought that the RELAY2.relay_multicasts is false, but it's not the case.

                      Please, add the configuration for this attribute to both configs as below:

                       

                      <relay site="site-1">
                          <remote-site name="site-2" stack="tcp" cluster="global"/>
                          <property name="relay_multicasts">false</property>
                      </relay>
                      
                      • 8. Re: Cross Data Center Replication Issues with JDG 6.2
                        Vijay Bhaskar Chintalapati Newbie

                        That was it !! Tested the solution and it worked beautifully. Cannot tell you how relieved I am !! Many many .... many thanks!