Just for any one interested in this topic: xsite replication in AWS environment.
I spent severl hours to debug this problem, and tried to come up a workaround.
Here are some findings:
1) some more backgroud info:
In AWS env, each VM will have a private IP assigned by AWS, which is not accisible from outside. The private IP is what the eth0 gets.
For accessing to those VMs from outside, I assigne each VM an Elastic IPs (EIP). Note. the VM does not really know the EIP. It is more like a NATing thing.
I confgiured relay2 jgroups_tcp.xml using EIPs (the initialHostStr).
With this configuration, Infinispan/Jgroups are able to find those nodes in TCPPing discovery. However, it still tries to connect to the private IPs in addition to connecting to the coressponding EIPs. The connection attempting to the private IPs hangs the cache manager startup. I saw 'syn_sent' state on a node which is connecting to the private IP of other site node, when the cach manager start hangs. In this environment, the node will not recive any response from other node for 'syn_sent'. But, after 'syn_sent' times out, the cache manager still does not get up fully.
I think the connection attempting on the private IP is the result that TCPPing discover (on EIPs) returns the private IP (which associates with eth0), and Jgroups TCP protocol layer uses the private IP to connect, instead of those EIPs configured in the jgroups-tcp.xml file.
After several hours strugging, I came up a workaround: I add nating rules on each node, which will forward traffic to remote private IPs to corresponding EIPs. In this way, we sitll use private IPs in jgroups-tcp.xml for xsite relay.
In short, the findings are:
1) the cache manage stucks in startup with in-reponsive or bad connectivity to other site is not desirable behavior. It should not impact the local cluster/site.
2) (out of box) current infinispan/jgroups will not be able to support xsite replication in AWS enviroment.
I had the same issue and was able to overcome it in the following way in the xml file that specifies your TCP settings for your WAN xsite replication cluster ensure the following elements and attributes are set in your configuration file.
1) in the <TCP> element add the following attribute like in he following pattern where elasticIP is a uniquie EC2 elastic IP: external_addr="elasitcIP"
2) if using TCPPING (note: I was only sucessfull making this TCPPING work in this use case) all of the memebers in the WAN xsite cluster must have a elastic IP associated with them. Make sure all of the entries in the <TCPPING> element for the initial_hosts attibute are called out like in he following pattern where elasticIP is a uniquie elastic IP and port is the port # for that server: initial_hosts="elasticIP[port],elasticIP[port]"
3) in the <FD_SOCK> element add the following attribute like in he following pattern where elasticIP is a uniquie EC2 elastic IP: external_addr="elasitcIP"
4) ensure that there is NO <BARRIER> element in your protocol stack for your WAN xsite replication cluster
5) in the <pbcast.NAKACK2> element add or change the discard_delivered_msgs attribute to be: discard_delivered_msgs="false"
6) change <pbcast.STATE_TRANSFER> element to a <pbcast.STATE_SOCK> element. this will enable TCP streaming state transfer
7) in the <pbcast.STATE_SOCK> element add the following attribute like in the following pattern where elasticIP is a uniquie EC2 elastic IP: external_addr="elasitcIP"
8) in your EC2 security group in both regions emable the following custom rules
TCP Rule, Port Range=30000, Source=0.0.0.0/0
ICMP Rule, Port Range=all, Source=0.0.0.0/0
Here is and example of what this should look like more or less:
<MERGE2 min_interval="10000" max_interval="30000"/>
<FD_SOCK external_addr="SomeElasticIPnumber1" />
<FD timeout="3000" max_tries="3" />
<VERIFY_SUSPECT timeout="1500" />
<pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="false"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>
<UFC max_credits="2M" min_threshold="0.4"/>
<MFC max_credits="2M" min_threshold="0.4"/>
<FRAG2 frag_size="60K" />
<!--RSVP resend_interval="2000" timeout="10000"/-->
<pbcast.STATE_SOCK external_addr="SomeElasticIPnumber1" />