12 Replies Latest reply on Apr 15, 2009 4:57 AM by Srinivasan Bakthavachalam

    FD Issue

    Srinivasan Bakthavachalam Newbie

      We are using JBoss Cache 1.4.1 SP6 with JGroups 2.4.x
      We have a cluster of cache instances with two Sun Solaris and multiple RHEL machines.

      When one of the RHEL instance is restarted, the VIEW of the cache instances in SOLARIS machines aren't updated.
      i.e. viewAccepted() - Still has the old RHEL instance along with the new RHEL instance(which was restarted)

      eg: [172.16.11.200:65261, 172.16.11.12:50903, 172.16.11.10:41912, 172.16.11.20:51156, 172.16.11.10:43789, 172.16.11.20:57771, 172.16.11.10:51722, 172.16.11.20:35858, 172.16.11.11:51210]

      172.16.11.10 - RHEL Instance 1
      172.16.11.20 - RHEL Instance 2

      Its assumed that when a cache instance goes down the view should be immediately when FD_SOCK is configured. But it wasn't updated as expected.

      Whereas the viewAccepted() was updated with active members and got resolved after some hours only.

      We got a ReplicationException timeout

      Received Throwable from remote node org.jboss.cache.ReplicationException: rsp=sender=172.16.11.10:41912, retval=null, received=false, suspected=false

      The code is as follows

      <attribute name="ClusterConfig">
       <config>
       <!-- UDP: if you have a multihomed machine,
       set the bind_addr attribute to the appropriate NIC IP address, e.g bind_addr="192.168.0.2"
       -->
       <!-- UDP: On Windows machines, because of the media sense feature
       being broken with multicast (even after disabling media sense)
       set the loopback attribute to true -->
       <UDP mcast_addr="224.7.8.9" mcast_port="45567"
       ip_ttl="64" ip_mcast="true"
       mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
       ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
       loopback="true" bind_addr="16.150.24.69"/>
       <PING timeout="2000" num_initial_members="3"
       up_thread="false" down_thread="false"/>
       <MERGE2 min_interval="10000" max_interval="20000"/>
       <!-- <FD shun="true" up_thread="true" down_thread="true" />-->
       <FD_SOCK/>
       <VERIFY_SUSPECT timeout="1500"
       up_thread="false" down_thread="false"/>
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
       max_xmit_size="8192" up_thread="false" down_thread="false"/>
       <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
       down_thread="false"/>
       <pbcast.STABLE desired_avg_gossip="20000"
       up_thread="false" down_thread="false"/>
       <FRAG frag_size="8192"
       down_thread="false" up_thread="false"/>
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
       shun="true" print_local_addr="true"/>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
       </config>
       </attribute>




      From the exception message we infer that 172.16.11.10:41912, this cache instance has been restarted and the current active instance was 172.16.11.10:51722

        • 1. Re: FD Issue
          Srinivasan Bakthavachalam Newbie

          The added code was the sample so the IP(bind_addr) is different but all the other configuration is the same.

          • 3. Re: FD Issue
            Srinivasan Bakthavachalam Newbie

            When one of the RHEL instance was brought down by killing the instance forcefully (kill -9 {PID}), the VIEW in the SOLARIS machines aren't updated even after testing with your suggestion.

            The sample config is as follows

            <attribute name="ClusterConfig">
             <config>
             <!-- UDP: if you have a multihomed machine,
             set the bind_addr attribute to the appropriate NIC IP address, e.g bind_addr="192.168.0.2"
             -->
             <!-- UDP: On Windows machines, because of the media sense feature
             being broken with multicast (even after disabling media sense)
             set the loopback attribute to true -->
             <UDP mcast_addr="224.7.8.9" mcast_port="45567"
             ip_ttl="64" ip_mcast="true"
             mcast_send_buf_size="150000" mcast_recv_buf_size="200000"
             ucast_send_buf_size="150000" ucast_recv_buf_size="200000"
             loopback="false" bind_addr="XX.XX.XX.XX"/>
             <PING timeout="2000" num_initial_members="3"
             up_thread="false" down_thread="false"/>
             <MERGE2 min_interval="10000" max_interval="20000"/>
             <FD shun="true" up_thread="true" down_thread="true" />
             <FD_SOCK/>
             <VERIFY_SUSPECT timeout="1500"
             up_thread="false" down_thread="false"/>
             <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
             max_xmit_size="8192" up_thread="false" down_thread="false" discard_delivered_msgs="true"/>
             <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
             down_thread="false"/>
             <pbcast.STABLE desired_avg_gossip="20000"
             up_thread="false" down_thread="false"/>
             <FRAG frag_size="8192"
             down_thread="false" up_thread="false"/>
             <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
             shun="true" print_local_addr="true"/>
             <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
             </config>
             </attribute>


            Please suggest your thoughts in this.
            Solaris and RHEL KEEP_ALIVE is with default(2hours) values

            • 4. Re: FD Issue
              Brian Stansberry Master

              With the config the old RHEL server instance should be removed from the view after 7.5 seconds (6 secs for FD to suspect it, plus 1.5 for VERIFY_SUSPECT).

              Are you saying you are still seeing a view that has two incarnations of the RHEL server in it? Is this after at least 7.5 seconds have passed?

              • 5. Re: FD Issue
                Srinivasan Bakthavachalam Newbie

                YES. The viewAccepted in the SOLARIS machines still have the dead members after some hours also.

                How can these members be deleted from the VIEW ?
                Should we set any other value to resolve this ??

                • 6. Re: FD Issue
                  Srinivasan Bakthavachalam Newbie

                  One of the RHEL instances has detected the failure and has the following messages

                  org.jgroups.protocols.FD --> [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[xx.xx.xx.xx:xx)]
                  org.jgroups.protocols.VERIFY_SUSPECT --> verifying that xx.xx.xx.xx is dead

                  But this message is not updated to the coordinator(SOLARIS) instance, so the VIEW is still having the dead members.

                  • 7. Re: FD Issue
                    Srinivasan Bakthavachalam Newbie

                    We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP8 but the problem still exists

                    • 8. Re: FD Issue
                      Srinivasan Bakthavachalam Newbie

                       

                      "karnivas" wrote:
                      We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP8 but the problem still exists


                      We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP4 but the problem still exists

                      • 9. Re: FD Issue
                        Brian Stansberry Master

                        First, I recommend you try the tests discussed at http://www.jboss.org/community/docs/DOC-12375 and http://www.jgroups.org/manual/html/ch02.html#ItDoesntWork. There could be a communication problem between the RHEL nodes and the Solaris nodes.

                        If all is well there, recommend you turn on TRACE logging for FD, VERIFY_SUSPECT and GMS and see if that tells you anything more. If not, enabled TRACE for all org.jgroups classes.

                        • 10. Re: FD Issue
                          Srinivasan Bakthavachalam Newbie

                          We brought down one of the SOLARIS machine(P1 - Co-ordinator) to check the view in all machines.

                          As expected, the co-ordinator changed to one of the RHEL machine by removing the P1 from all views, but the dead RHEL members wasn't updated in the VIEW

                          Please find the DEBUG messages of jgroups.log

                          org.jgroups.protocols.pbcast.GMS --> new=[172.16.11.200:32790], suspected=[], leaving=[], new view: [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]
                          org.jgroups.protocols.pbcast.GMS --> mcasting view {[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]} (9 mbrs)
                          org.jgroups.protocols.UDP --> sending msg to null (src=172.16.11.20:35858), headers are {NAKACK=[MSG, seqno=3782], GMS= GmsHeader[VIEW]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UDP =[channel_name=ProvCache-LABS]}
                          org.jgroups.protocols.UDP --> message is [dst: 224.7.8.9:45567, src: 172.16.11.20:35858 (3 headers), size = 0 bytes], h eaders are {GMS=GmsHeader[VIEW]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 17 2.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11 .200:32790], NAKACK=[MSG, seqno=3782], UDP=[channel_name=ProvCache-LABS]}
                          org.jgroups.protocols.pbcast.GMS --> view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.1 91:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.13:38538, 172.16.11.200:32790]
                          org.jgroups.protocols.pbcast.GMS --> [local_addr=172.16.11.20:35858] view is [172.16.11.20:35858|259] [172.16.11.20:358 58, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]
                          org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.12:40087 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                          org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.10:51918 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                          org.jgroups.protocols.UDP --> sending msg to 172.16.11.20:35858 (src=172.16.11.20:35858), headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UDP=[channe l_name=ProvCache-LABS], UNICAST=[UNICAST: DATA, seqno=1]}
                          org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.20:35858 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                          org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.11:51210 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                          org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.191:37204 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:5 1918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                          org.jgroups.protocols.pbcast.GMS --> failed to collect all ACKs (11) for view [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790] after 2000ms, missing ACKs from [172.16.11.13:38513, 172.16.11.13:38515, 172.16.11.13:38520, 172.16.11.13:38533] (received=[172.16.11.11:51210, 172.16.11.20:35858, 172.16.11.1 91:37204, 172.16.11.12:40087, 172.16.11.10:51918]), local_addr=172.16.11.20:35858
                          org.jgroups.protocols.UDP --> sending msg to 172.16.11.200:32790 (src=172.16.11.20:35858), headers are {GMS=GmsHeader[JOIN_RSP]: join_rsp=view: [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], digest: 172.16.11.11:51210: [0 : 0], 172.16.11.13:38513: [0 : 0], 172.16.11.10:51918: [4481 : 4482], 172.16.11.12:4008 7: [0 : 0], 172.16.11.13:38520: [0 : 0], 172.16.11.200:32790: [0 : 0], 172.16.11.20:35858: [3781 : 3782], 172.16.11.13 :38533: [0 : 0], 172.16.11.191:37204: [3685 : 3686], UDP=[channel_name=ProvCache-LABS], UNICAST=[UNICAST: DATA , seqno=1]}
                          org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.200:32790 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533,172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=2], UDP=[channel_name=ProvCache-LABS]}



                          org.jgroups.protocols.UDP --> message is [dst: 224.7.8.9:45567, src: 172.16.11.12:40087 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=ProvCache-LABS], FD=[FD: SUSPECT (suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533], from=172.16.11.12:40087)]}
                          org.jgroups.protocols.FD --> [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533], from=172.16.11.12:40087)]
                          org.jgroups.protocols.VERIFY_SUSPECT --> verifying that 172.16.11.13:38513 is dead
                          org.jgroups.protocols.UDP --> sending msg to 172.16.11.13:38513 (src=172.16.11.10:51918), headers are {VERIFY_SUSPECT=[VERIFY_SUSPECT: ARE_YOU_DEAD], UDP=[channel_name=ProvCache-LABS]}
                          org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38513 is dead (passing up SUSPECT event)
                          org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38533 is dead (passing up SUSPECT event)
                          org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38520 is dead (passing up SUSPECT event)
                          org.jgroups.protocols.pbcast.GMS --> processing [SUSPECT(172.16.11.13:38513), SUSPECT(172.16.11.13:38533), SUSPECT(172.16.11.13:38520)]
                          org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38513
                          org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38533
                          org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38520
                          org.jgroups.protocols.pbcast.GMS --> suspected members=[172.16.11.13:38513, 172.16.11.13:38533, 172.16.11.13:38520], suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38533, 172.16.11.13:38520]
                          



                          As per these logs, the co-ordinator identifies the dead members correctly but don't update the view properly, please advice on this

                          Please tell us how to overcome...

                          • 11. Re: FD Issue
                            Srinivasan Bakthavachalam Newbie

                            Just to give an update on this, it was identified that the dead member's(suspected member's) neighbor was having a wrong gateway entry in the routing table configuration.

                            Instead of having its own IP it was pointing to another IP which was also in the same cluster.

                            Once that instance was restarted the VIEW was updated successfully and replicated to all members.

                            Would this have been the cause for the problem ?

                            Please clarify on this to further analyze on this.

                            • 12. Re: FD Issue
                              Srinivasan Bakthavachalam Newbie

                              Any thoughts on this ???
                              Please reply.