1 Reply Latest reply on Sep 14, 2014 5:35 AM by prashant.thakur

    Failed node rejoin problem

    phanikotharu

      I am running into an issue when testing jgroups with infinispan clustering. My setup is simple, where I have 2 inifinispan nodes setup to replicate using jgroups. Initially everything works fine where the sequence of steps is as follows

      1) Start Node1 , node1 is the only member in cluster
      2) Start Node2, node2 joins the cluster
      3) The cluster is functional and node1 prints message showing node2 joined
      4) Replication works fine

      However, the problem occurs, when node2 is shutdown or terminated and when node2 is restarted, it fails to join the previous cluster and ends up creating a new cluster with only itself as the member.

      I am using UDP multicast for clustering. Here is my JGroups config. Can anyone point out what I am doing wrong. After Node2 starts, after a few seconds I see this message printed a few times and then the message printing stops

      WARN TransferQueueBundler,CentOS-MSM-03-33056 2014-09-11 16:44:24-UDP:JGRP000032: CentOS-MSM-03-33056: no physical address for 15059453-535f-7e26-3fb6-5a1f62d95734, dropping message

      WARN TransferQueueBundler,CentOS-MSM-03-33056 2014-09-11 16:44:38-UDP:JGRP000032: CentOS-MSM-03-33056: no physical address for 15059453-535f-7e26-3fb6-5a1f62d95734, dropping message

      WARN TransferQueueBundler,CentOS-MSM-03-33056 2014-09-11 16:45:01-UDP:JGRP000032: CentOS-MSM-03-33056: no physical address for 15059453-535f-7e26-3fb6-5a1f62d95734, dropping message

          
      

      <config xmlns="urn:org:jgroups"

              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

              xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">

         <UDP

               mcast_addr="${jgroups.udp.mcast_addr:228.6.7.8}"

               mcast_port="${jgroups.udp.mcast_port:46655}"

               tos="8"

               ucast_recv_buf_size="20m"

               ucast_send_buf_size="640k"

               mcast_recv_buf_size="25m"

               mcast_send_buf_size="640k"

               loopback="true"

               max_bundle_size="64k"

               ip_ttl="${jgroups.udp.ip_ttl:2}"

               enable_diagnostics="true"

               bundler_type="new"

       

       

               thread_naming_pattern="pl"

       

       

               thread_pool.enabled="true"

               thread_pool.min_threads="2"

               thread_pool.max_threads="30"

               thread_pool.keep_alive_time="60000"

               thread_pool.queue_enabled="true"

               thread_pool.queue_max_size="100"

               thread_pool.rejection_policy="Discard"

       

       

               oob_thread_pool.enabled="true"

               oob_thread_pool.min_threads="2"

               oob_thread_pool.max_threads="30"

               oob_thread_pool.keep_alive_time="60000"

               oob_thread_pool.queue_enabled="false"

               oob_thread_pool.queue_max_size="100"

               oob_thread_pool.rejection_policy="Discard"

               />

       

       

         <PING timeout="5000" num_initial_members="2"/>

         <MERGE2 />

       

       

         <FD_SOCK/>

         <FD_ALL timeout="15000"/>

         <VERIFY_SUSPECT timeout="5000"/>

       

       

         <pbcast.NAKACK2

                          xmit_interval="1000"

                          xmit_table_num_rows="100"

                          xmit_table_msgs_per_row="10000"

                          xmit_table_max_compaction_time="10000"

                          max_msg_batch_size="100"/>

       

       

         <!-- UNICAST3 is the better strategy moving forward but not

              yet compatible with the JGroups version included in EAP6 -->

         <!-- <UNICAST3

                    xmit_interval="500"

                    xmit_table_num_rows="20"

                    xmit_table_msgs_per_row="10000"

                    xmit_table_max_compaction_time="10000"

                    max_msg_batch_size="100"

      conn_expiry_timeout="0"/> -->

       

       

         <UNICAST2

                    stable_interval="5000"

                    xmit_interval="500"

                    max_bytes="1m"

                    xmit_table_num_rows="20"

                    xmit_table_msgs_per_row="10000"

                    xmit_table_max_compaction_time="10000"

                    max_msg_batch_size="100"

                    conn_expiry_timeout="0"/>

       

       

         <pbcast.STABLE stability_delay="500" desired_avg_gossip="50000" max_bytes="1m"/>

         <pbcast.GMS print_local_addr="true" join_timeout="5000" view_bundling="false" merge_timeout="50000" stats="true" log_view_warnings="true"/>

         <tom.TOA/> <!-- the TOA is only needed for total order transactions-->

       

       

         <UFC max_credits="500k" min_threshold="0.20"/>

         <MFC max_credits="500k" min_threshold="0.20"/>

         <FRAG2 frag_size="8000"  />

         <RSVP timeout="10000" resend_interval="1000" ack_on_delivery="true" throw_exception_on_timeout="false"/>

      </config>

       

        • 1. Re: Failed node rejoin problem
          prashant.thakur

          We had a similar issue, with docker being installed in the same box. The first interface pointed to docker. Hence it was not able to connect.

          Can you please check the output of ifconfig -a ? Whether the first interface is something else than what you are expecting.

          bind_addr parameter can be added to resolve this issue by specifying the exact address you want to bind this UDP settings.