8 Replies Latest reply on Mar 3, 2014 9:58 AM by vbchin2

Cross Data Center Replication Issues with JDG 6.2

vbchin2 Feb 28, 2014 11:34 AM

I have been trying to setup a demo of Cross Data Center Replication using JDG 6.2 and so far haven't been able to get it to work properly. I need help to understand what could be fixed to get the testing steps listed below work correctly and successfully.

SETUP (please refer to the attachments)

Single VM (Mac OS X) (with IP address configured to be 192.168.1.5 for testing purposes)
Two sites: site-1 and site-2 and a cluster (of 3 JDG 6.2 nodes), one for each site
One distributed cache labCache configured to be backed-up in under its own cache-container: xsite
All the JDG nodes are launched with a difference of 100 for port-offset. So, node-1 is given a port offset of 100, node-2 200, and so on with node-6 a value of 600
Each node is given a copy of JDG 6.2 standalone folder. So node-1 runs runs in folder standalone1 and so on with node-6 runs in standalone6
Nodes under site-1 cluster use a multicast port of 239.1.1.1 and nodes under site-2 use 239.2.2.2
With different multicast port settings for two clusters, the discovery of one cluster from other happens via MPING using address: 234.99.54.14 and port: 12000

TESTING STEPS

Bring both sites up and ensure clustering of nodes under each site
Use the Infinispan CLI to push the data of 100 entries into the cache hosted by node-1. Ensure distribution across cluster and back up across the site

./ispn-cli.sh -c remoting://192.168.1.5:10099/xsite/labCache -f input-data-site-1.txt
Use the Infinispan CLI again to log in into the same node and this time just issue the clear command and ensure the erasure of entries for all the caches
Repeat step #2

FAILURE

The failure happened at step #4 where repopulating the cache with same information just dragged on for several minutes with the node 1 eventually timing out with CLI and with various warnings and exceptions as shown below

WARNINGS AND EXCEPTIONS

Following are the various warning and exceptions that are found repeatedly:

standalone1.log:

19:54:51,027 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-18) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Replication timeout for jdg-2/site-1

standalone4.log:

19:54:02,268 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-2) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [30 seconds] on key [0] for requestor [Thread[remote-thread-2,5,main]]! Lock held by [Thread[remote-thread-0,5,main]]

standalone6.log:

19:54:45,728 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (OOB-7,shared=relay) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Replication timeout for jdg-4/site-2

standalone6.log:

19:54:45,765 WARN [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-3,shared=relay) ISPN000071: Caught exception when handling command SingleRpcCommand{cacheName='labCache', command=ClearCommand{flags=[IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP]}}: org.infinispan.util.concurrent.TimeoutException: Replication timeout for jdg-4/site-2

standalone6.log:

19:54:02,339 WARN [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (remote-thread-8) ISPN000071: Caught exception when handling command SingleRpcCommand{cacheName='labCache', command=ClearCommand{flags=[IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP]}}: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [30 seconds] on key [2] for requestor [Thread[remote-thread-8,5,main]]! Lock held by [Thread[remote-thread-3,5,main]]

standalone5.log:

19:54:53,532 WARN [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-187,shared=tcp) ISPN000071: Caught exception when handling command SingleRpcCommand{cacheName='labCache', command=PutKeyValueCommand{key=0, value=Some data, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}}: org.infinispan.util.concurrent.TimeoutException: Node jdg-4/site-2 timed out

site-1.xml 15.0 KB
site-2.xml 15.0 KB
start-jdg-cluster.sh 1.4 KB
logs.zip 81.5 KB
input-data-site-1.txt.zip 385 bytes

1. Re: Cross Data Center Replication Issues with JDG 6.2

rvansa Mar 3, 2014 3:20 AM (in response to vbchin2)

How many times have you issued the clear command? It seems that there are many clear commands running in parallel, and as this command has to acquire locks for all entries in cache, multiple clears would content against each other.

Can you also setup TRACE level logging on org.infinispan and post the full logs? (probably you'd have to store the zipped logs somewhere on external location such as DropBox)
Actions
2. Re: Cross Data Center Replication Issues with JDG 6.2

vbchin2 Mar 3, 2014 3:24 AM (in response to rvansa)

The clear was issued just once and only on node-1. I will upload the logs with TRACE enabled to a site and let you know shortly.
Actions
3. Re: Re: Cross Data Center Replication Issues with JDG 6.2

vbchin2 Mar 3, 2014 4:04 AM (in response to rvansa)
Attached is the ZIP file of all the necessary logs. If it helps, I am listing out history of the command and time when I am executing the command just to show you how the last command almost gets stuck for a long time.

PROMPT:bin>./ispn-cli.sh -c remoting://192.168.1.6:10099/xsite/labCache -f input-data-site-1.txt

Mar 03, 2014 12:47:12 AM org.xnio.Xnio <clinit>

INFO: XNIO Version 3.0.7.GA-redhat-1

Mar 03, 2014 12:47:12 AM org.xnio.nio.NioXnio <clinit>

INFO: XNIO NIO Implementation Version 3.0.7.GA-redhat-1

Mar 03, 2014 12:47:12 AM org.jboss.remoting3.EndpointImpl <clinit>

INFO: JBoss Remoting version 3.2.16.GA-redhat-1

PROMPT:bin>./ispn-cli.sh -c remoting://192.168.1.6:10099/xsite/labCache

Mar 03, 2014 12:47:36 AM org.xnio.Xnio <clinit>

INFO: XNIO Version 3.0.7.GA-redhat-1

Mar 03, 2014 12:47:36 AM org.xnio.nio.NioXnio <clinit>

INFO: XNIO NIO Implementation Version 3.0.7.GA-redhat-1

Mar 03, 2014 12:47:36 AM org.jboss.remoting3.EndpointImpl <clinit>

INFO: JBoss Remoting version 3.2.16.GA-redhat-1

[remoting://192.168.1.6:10099/xsite/labCache]> clear

[remoting://192.168.1.6:10099/xsite/labCache]> quit

PROMPT:bin>./ispn-cli.sh -c remoting://192.168.1.6:10099/xsite/labCache -f input-data-site-1.txt

Mar 03, 2014 12:48:14 AM org.xnio.Xnio <clinit>

INFO: XNIO Version 3.0.7.GA-redhat-1

Mar 03, 2014 12:48:14 AM org.xnio.nio.NioXnio <clinit>

INFO: XNIO NIO Implementation Version 3.0.7.GA-redhat-1

Mar 03, 2014 12:48:14 AM org.jboss.remoting3.EndpointImpl <clinit>

INFO: JBoss Remoting version 3.2.16.GA-redhat-1

Node jdg-3/site-1 timed out

Node jdg-3/site-1 timed out

Node jdg-3/site-1 timed out

^Cjava.io.IOException: Connection Ended

org.jboss.remoting3.NotOpenException: Writes closed

~~~~~~~~ Multiple repeat of Writes closed due to force termination ~~~~~~~~~

PROMPT:bin>date

Mon Mar 3 00:51:02 PST 2014

PROMPT:bin>

logs.zip 2.1 MB
Actions
4. Re: Re: Cross Data Center Replication Issues with JDG 6.2

rvansa Mar 3, 2014 5:01 AM (in response to vbchin2)

Thanks, the logs show that this is really an issue. The ClearCommand is propagated to the backup site multiple times, which results in deadlock (when it is started on both node A and B in parallel, each one locks all local entries and then attempts to lock entries on the other node).

However, I still can't understand why some of the replications occur - in fact I've seen replicating the command from backup to main cache. Would you be so kind and run it once more with TRACE level on org.jgroups as well (in addition to org.infinispan)? Just load the first 100 entries and then issue the clear command.
Actions
5. Re: Cross Data Center Replication Issues with JDG 6.2

vbchin2 Mar 3, 2014 5:11 AM (in response to rvansa)

I am on it! Meanwhile, if you could, can you please take a look at the Infinispan subsystems in site-1.xml and site-2.xml (attached to the original post) to check if the configurations of the backups is all correct.

For Site-1, Site 2 is configured as backup and for Site-2, Site-1 is configured as backup. Could that be the reason why the commands are looping between the sites incessantly ?
Actions
6. Re: Cross Data Center Replication Issues with JDG 6.2

rvansa Mar 3, 2014 5:46 AM (in response to vbchin2)

Backing up sites in the same fashion is supported, but it seemed rather that the message which was broadcast in site-2 somehow made it into site-1. I'll see more from JGroups, this way I am just guessing according to timestamps.
Actions
7. Re: Re: Cross Data Center Replication Issues with JDG 6.2

rvansa Mar 3, 2014 7:24 AM (in response to vbchin2)
So, it looks like a problem in documentation - I have thought that the RELAY2.relay_multicasts is false, but it's not the case.
Please, add the configuration for this attribute to both configs as below:

<relay site="site-1"> <remote-site name="site-2" stack="tcp" cluster="global"/> <property name="relay_multicasts">false</property> </relay>
Actions
8. Re: Cross Data Center Replication Issues with JDG 6.2

vbchin2 Mar 3, 2014 9:58 AM (in response to rvansa)

That was it !! Tested the solution and it worked beautifully. Cannot tell you how relieved I am !! Many many .... many thanks!
Actions

Go to original post