Cross DC Relay2 - Master node doesn't handover cluster information to new master during failure
vikrant02 Feb 6, 2018 11:09 AMHi,
We have Infinispan setup in openshift environment deployed across two data centers. we are using kubernetes jgroups stack to form local cluster and tcp stack with tcpping as discovery protocol to form cross site bridge.
Following is the jgroups configuration
<subsystem
xmlns="urn:infinispan:server:jgroups:9.0">
<channels default="cluster">
<channel name="cluster"/>
<channel name="xsite" stack="tcp"/>
</channels>
<stacks default="${jboss.default.jgroups.stack:kubernetes}">
<stack name="tcp">
<transport type="TCP" socket-binding="jgroups-tcp">
<property name="external_addr">${jgroups.tcp.external_addr:}</property>
</transport>
<protocol type="TCPPING">
<property name="initial_hosts">${jgroups.tcpping.initial_hosts:}</property>
<property name="ergonomics">false</property>
</protocol>
<protocol type="MERGE3">
<property name="min_interval">10000</property>
<property name="max_interval">30000</property>
</protocol>
<protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
<protocol type="FD_ALL">
<property name="timeout">60000</property>
<property name="interval">15000</property>
<property name="timeout_check_interval">5000</property>
</protocol>
<protocol type="VERIFY_SUSPECT">
<property name="timeout">5000</property>
</protocol>
<protocol type="pbcast.NAKACK2">
<property name="use_mcast_xmit">false</property>
<property name="xmit_interval">100</property>
<property name="xmit_table_num_rows">50</property>
<property name="xmit_table_msgs_per_row">1024</property>
<property name="xmit_table_max_compaction_time">30000</property>
<property name="resend_last_seqno">true</property>
</protocol>
<protocol type="UNICAST3">
<property name="xmit_interval">100</property>
<property name="xmit_table_num_rows">50</property>
<property name="xmit_table_msgs_per_row">1024</property>
<property name="xmit_table_max_compaction_time">30000</property>
<property name="conn_expiry_timeout">0</property>
</protocol>
<protocol type="pbcast.STABLE">
<property name="stability_delay">500</property>
<property name="desired_avg_gossip">5000</property>
<property name="max_bytes">1M</property>
</protocol>
<protocol type="pbcast.GMS">
<property name="print_local_addr">true</property>
<property name="install_view_locally_first">true</property>
<property name="join_timeout">${jgroups.join_timeout:5000}</property>
</protocol>
<protocol type="MFC">
<property name="max_credits">2m</property>
<property name="min_threshold">0.40</property>
</protocol>
<protocol type="FRAG3"/>
<protocol type="RSVP"/>
</stack>
<stack name="kubernetes">
<transport type="TCP" socket-binding="jgroups-tcp">
<property name="logical_addr_cache_expiration">360000</property>
</transport>
<protocol type="kubernetes.KUBE_PING"/>
<protocol type="MERGE3">
<property name="min_interval">10000</property>
<property name="max_interval">30000</property>
</protocol>
<protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
<protocol type="FD_ALL">
<property name="timeout">60000</property>
<property name="interval">15000</property>
<property name="timeout_check_interval">5000</property>
</protocol>
<protocol type="VERIFY_SUSPECT">
<property name="timeout">5000</property>
</protocol>
<protocol type="pbcast.NAKACK2">
<property name="use_mcast_xmit">false</property>
<property name="xmit_interval">100</property>
<property name="xmit_table_num_rows">50</property>
<property name="xmit_table_msgs_per_row">1024</property>
<property name="xmit_table_max_compaction_time">30000</property>
<property name="resend_last_seqno">true</property>
</protocol>
<protocol type="UNICAST3">
<property name="xmit_interval">100</property>
<property name="xmit_table_num_rows">50</property>
<property name="xmit_table_msgs_per_row">1024</property>
<property name="xmit_table_max_compaction_time">30000</property>
<property name="conn_expiry_timeout">0</property>
</protocol>
<protocol type="pbcast.STABLE">
<property name="stability_delay">500</property>
<property name="desired_avg_gossip">5000</property>
<property name="max_bytes">1M</property>
</protocol>
<protocol type="pbcast.GMS">
<property name="print_local_addr">true</property>
<property name="install_view_locally_first">true</property>
<property name="join_timeout">${jgroups.join_timeout:5000}</property>
</protocol>
<protocol type="MFC">
<property name="max_credits">2m</property>
<property name="min_threshold">0.40</property>
</protocol>
<protocol type="FRAG3"/>
<relay site="SITE1">
<property name="relay_multicasts">false</property>
<property name="max_site_masters">2</property>
</relay>
</stack>
</stacks>
</subsystem>
Cluster startup is sequential i.e. first site1 starts and then site2. We are populating initial_hosts list based on the available pods during the infinispan startup i.e. first infinispan in site1 will have only its own hostname in initial_hosts list but the last infinispan in site2 will contain hostnames of all infinispans in its initial_hosts list.
Since it is an openshift deployment we will not know the ip addresses of infinispan until the pod starts, which means we can't pre-populate initial_hosts with ip address of all infinispan.
After initial sequential startup both data centers are successfully able to join cross DC bridge but if the master in site1 goes down the master address for site2 doesn't gets passed to new master in site1 and new master needs to do discovery again since infinispan pods in site1 started first the initial_hosts doesn't contain hostname for site2 master and cross site bridge gets broken till site2 master dos the discovery again. This leads to cross dc replication downtime of approx 1 mins every time.
Is there any way to make sure that if site master goes down then the new site master gets info of other site's master, or any other way to manage cross DC in dynamic openshift environment.
Thanks,
Vikrant