1 Reply Latest reply on Sep 14, 2014 5:35 AM by prashant.thakur

Failed node rejoin problem

phanikotharu Sep 11, 2014 7:46 PM

I am running into an issue when testing jgroups with infinispan clustering. My setup is simple, where I have 2 inifinispan nodes setup to replicate using jgroups. Initially everything works fine where the sequence of steps is as follows

1) Start Node1 , node1 is the only member in cluster
2) Start Node2, node2 joins the cluster
3) The cluster is functional and node1 prints message showing node2 joined
4) Replication works fine

However, the problem occurs, when node2 is shutdown or terminated and when node2 is restarted, it fails to join the previous cluster and ends up creating a new cluster with only itself as the member.

I am using UDP multicast for clustering. Here is my JGroups config. Can anyone point out what I am doing wrong. After Node2 starts, after a few seconds I see this message printed a few times and then the message printing stops

WARN TransferQueueBundler,CentOS-MSM-03-33056 2014-09-11 16:44:24-UDP:JGRP000032: CentOS-MSM-03-33056: no physical address for 15059453-535f-7e26-3fb6-5a1f62d95734, dropping message

WARN TransferQueueBundler,CentOS-MSM-03-33056 2014-09-11 16:44:38-UDP:JGRP000032: CentOS-MSM-03-33056: no physical address for 15059453-535f-7e26-3fb6-5a1f62d95734, dropping message

WARN TransferQueueBundler,CentOS-MSM-03-33056 2014-09-11 16:45:01-UDP:JGRP000032: CentOS-MSM-03-33056: no physical address for 15059453-535f-7e26-3fb6-5a1f62d95734, dropping message

    
<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">
   <UDP
         mcast_addr="${jgroups.udp.mcast_addr:228.6.7.8}"
         mcast_port="${jgroups.udp.mcast_port:46655}"
         tos="8"
         ucast_recv_buf_size="20m"
         ucast_send_buf_size="640k"
         mcast_recv_buf_size="25m"
         mcast_send_buf_size="640k"
         loopback="true"
         max_bundle_size="64k"
         ip_ttl="${jgroups.udp.ip_ttl:2}"
         enable_diagnostics="true"
         bundler_type="new"
 
 
         thread_naming_pattern="pl"
 
 
         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="30"
         thread_pool.keep_alive_time="60000"
         thread_pool.queue_enabled="true"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="Discard"
 
 
         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="2"
         oob_thread_pool.max_threads="30"
         oob_thread_pool.keep_alive_time="60000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="Discard"
         />
 
 
   <PING timeout="5000" num_initial_members="2"/>
   <MERGE2 />
 
 
   <FD_SOCK/>
   <FD_ALL timeout="15000"/>
   <VERIFY_SUSPECT timeout="5000"/>
 
 
   <pbcast.NAKACK2
                    xmit_interval="1000"
                    xmit_table_num_rows="100"
                    xmit_table_msgs_per_row="10000"
                    xmit_table_max_compaction_time="10000"
                    max_msg_batch_size="100"/>
 
 
   <!-- UNICAST3 is the better strategy moving forward but not
        yet compatible with the JGroups version included in EAP6 -->
   <!-- <UNICAST3
              xmit_interval="500"
              xmit_table_num_rows="20"
              xmit_table_msgs_per_row="10000"
              xmit_table_max_compaction_time="10000"
              max_msg_batch_size="100"
conn_expiry_timeout="0"/> -->
 
 
   <UNICAST2
              stable_interval="5000"
              xmit_interval="500"
              max_bytes="1m"
              xmit_table_num_rows="20"
              xmit_table_msgs_per_row="10000"
              xmit_table_max_compaction_time="10000"
              max_msg_batch_size="100"
              conn_expiry_timeout="0"/>
 
 
   <pbcast.STABLE stability_delay="500" desired_avg_gossip="50000" max_bytes="1m"/>
   <pbcast.GMS print_local_addr="true" join_timeout="5000" view_bundling="false" merge_timeout="50000" stats="true" log_view_warnings="true"/>
   <tom.TOA/> <!-- the TOA is only needed for total order transactions-->
 
 
   <UFC max_credits="500k" min_threshold="0.20"/>
   <MFC max_credits="500k" min_threshold="0.20"/>
   <FRAG2 frag_size="8000"  />
   <RSVP timeout="10000" resend_interval="1000" ack_on_delivery="true" throw_exception_on_timeout="false"/>
</config>

1. Re: Failed node rejoin problem

prashant.thakur Sep 14, 2014 5:35 AM (in response to phanikotharu)

We had a similar issue, with docker being installed in the same box. The first interface pointed to docker. Hence it was not able to connect.
Can you please check the output of ifconfig -a ? Whether the first interface is something else than what you are expecting.
bind_addr parameter can be added to resolve this issue by specifying the exact address you want to bind this UDP settings.
Actions