12 Replies Latest reply on Apr 15, 2009 4:57 AM by karnivas

    FD Issue

    karnivas

      We are using JBoss Cache 1.4.1 SP6 with JGroups 2.4.x
      We have a cluster of cache instances with two Sun Solaris and multiple RHEL machines.

      When one of the RHEL instance is restarted, the VIEW of the cache instances in SOLARIS machines aren't updated.
      i.e. viewAccepted() - Still has the old RHEL instance along with the new RHEL instance(which was restarted)

      eg: [172.16.11.200:65261, 172.16.11.12:50903, 172.16.11.10:41912, 172.16.11.20:51156, 172.16.11.10:43789, 172.16.11.20:57771, 172.16.11.10:51722, 172.16.11.20:35858, 172.16.11.11:51210]

      172.16.11.10 - RHEL Instance 1
      172.16.11.20 - RHEL Instance 2

      Its assumed that when a cache instance goes down the view should be immediately when FD_SOCK is configured. But it wasn't updated as expected.

      Whereas the viewAccepted() was updated with active members and got resolved after some hours only.

      We got a ReplicationException timeout

      Received Throwable from remote node org.jboss.cache.ReplicationException: rsp=sender=172.16.11.10:41912, retval=null, received=false, suspected=false

      The code is as follows

      <attribute name="ClusterConfig">
       <config>
       <!-- UDP: if you have a multihomed machine,
       set the bind_addr attribute to the appropriate NIC IP address, e.g bind_addr="192.168.0.2"
       -->
       <!-- UDP: On Windows machines, because of the media sense feature
       being broken with multicast (even after disabling media sense)
       set the loopback attribute to true -->
       <UDP mcast_addr="224.7.8.9" mcast_port="45567"
       ip_ttl="64" ip_mcast="true"
       mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
       ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
       loopback="true" bind_addr="16.150.24.69"/>
       <PING timeout="2000" num_initial_members="3"
       up_thread="false" down_thread="false"/>
       <MERGE2 min_interval="10000" max_interval="20000"/>
       <!-- <FD shun="true" up_thread="true" down_thread="true" />-->
       <FD_SOCK/>
       <VERIFY_SUSPECT timeout="1500"
       up_thread="false" down_thread="false"/>
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
       max_xmit_size="8192" up_thread="false" down_thread="false"/>
       <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
       down_thread="false"/>
       <pbcast.STABLE desired_avg_gossip="20000"
       up_thread="false" down_thread="false"/>
       <FRAG frag_size="8192"
       down_thread="false" up_thread="false"/>
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
       shun="true" print_local_addr="true"/>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
       </config>
       </attribute>




      From the exception message we infer that 172.16.11.10:41912, this cache instance has been restarted and the current active instance was 172.16.11.10:51722

        • 1. Re: FD Issue
          karnivas

          The added code was the sample so the IP(bind_addr) is different but all the other configuration is the same.

          • 2. Re: FD Issue
            brian.stansberry

            Use both FD and FD_SOCK. See explanation at http://www.jboss.org/community/docs/DOC-9739

            • 3. Re: FD Issue
              karnivas

              When one of the RHEL instance was brought down by killing the instance forcefully (kill -9 {PID}), the VIEW in the SOLARIS machines aren't updated even after testing with your suggestion.

              The sample config is as follows

              <attribute name="ClusterConfig">
               <config>
               <!-- UDP: if you have a multihomed machine,
               set the bind_addr attribute to the appropriate NIC IP address, e.g bind_addr="192.168.0.2"
               -->
               <!-- UDP: On Windows machines, because of the media sense feature
               being broken with multicast (even after disabling media sense)
               set the loopback attribute to true -->
               <UDP mcast_addr="224.7.8.9" mcast_port="45567"
               ip_ttl="64" ip_mcast="true"
               mcast_send_buf_size="150000" mcast_recv_buf_size="200000"
               ucast_send_buf_size="150000" ucast_recv_buf_size="200000"
               loopback="false" bind_addr="XX.XX.XX.XX"/>
               <PING timeout="2000" num_initial_members="3"
               up_thread="false" down_thread="false"/>
               <MERGE2 min_interval="10000" max_interval="20000"/>
               <FD shun="true" up_thread="true" down_thread="true" />
               <FD_SOCK/>
               <VERIFY_SUSPECT timeout="1500"
               up_thread="false" down_thread="false"/>
               <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
               max_xmit_size="8192" up_thread="false" down_thread="false" discard_delivered_msgs="true"/>
               <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
               down_thread="false"/>
               <pbcast.STABLE desired_avg_gossip="20000"
               up_thread="false" down_thread="false"/>
               <FRAG frag_size="8192"
               down_thread="false" up_thread="false"/>
               <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
               shun="true" print_local_addr="true"/>
               <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
               </config>
               </attribute>


              Please suggest your thoughts in this.
              Solaris and RHEL KEEP_ALIVE is with default(2hours) values

              • 4. Re: FD Issue
                brian.stansberry

                With the config the old RHEL server instance should be removed from the view after 7.5 seconds (6 secs for FD to suspect it, plus 1.5 for VERIFY_SUSPECT).

                Are you saying you are still seeing a view that has two incarnations of the RHEL server in it? Is this after at least 7.5 seconds have passed?

                • 5. Re: FD Issue
                  karnivas

                  YES. The viewAccepted in the SOLARIS machines still have the dead members after some hours also.

                  How can these members be deleted from the VIEW ?
                  Should we set any other value to resolve this ??

                  • 6. Re: FD Issue
                    karnivas

                    One of the RHEL instances has detected the failure and has the following messages

                    org.jgroups.protocols.FD --> [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[xx.xx.xx.xx:xx)]
                    org.jgroups.protocols.VERIFY_SUSPECT --> verifying that xx.xx.xx.xx is dead

                    But this message is not updated to the coordinator(SOLARIS) instance, so the VIEW is still having the dead members.

                    • 7. Re: FD Issue
                      karnivas

                      We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP8 but the problem still exists

                      • 8. Re: FD Issue
                        karnivas

                         

                        "karnivas" wrote:
                        We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP8 but the problem still exists


                        We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP4 but the problem still exists

                        • 9. Re: FD Issue
                          brian.stansberry

                          First, I recommend you try the tests discussed at http://www.jboss.org/community/docs/DOC-12375 and http://www.jgroups.org/manual/html/ch02.html#ItDoesntWork. There could be a communication problem between the RHEL nodes and the Solaris nodes.

                          If all is well there, recommend you turn on TRACE logging for FD, VERIFY_SUSPECT and GMS and see if that tells you anything more. If not, enabled TRACE for all org.jgroups classes.

                          • 10. Re: FD Issue
                            karnivas

                            We brought down one of the SOLARIS machine(P1 - Co-ordinator) to check the view in all machines.

                            As expected, the co-ordinator changed to one of the RHEL machine by removing the P1 from all views, but the dead RHEL members wasn't updated in the VIEW

                            Please find the DEBUG messages of jgroups.log

                            org.jgroups.protocols.pbcast.GMS --> new=[172.16.11.200:32790], suspected=[], leaving=[], new view: [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]
                            org.jgroups.protocols.pbcast.GMS --> mcasting view {[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]} (9 mbrs)
                            org.jgroups.protocols.UDP --> sending msg to null (src=172.16.11.20:35858), headers are {NAKACK=[MSG, seqno=3782], GMS= GmsHeader[VIEW]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UDP =[channel_name=ProvCache-LABS]}
                            org.jgroups.protocols.UDP --> message is [dst: 224.7.8.9:45567, src: 172.16.11.20:35858 (3 headers), size = 0 bytes], h eaders are {GMS=GmsHeader[VIEW]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 17 2.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11 .200:32790], NAKACK=[MSG, seqno=3782], UDP=[channel_name=ProvCache-LABS]}
                            org.jgroups.protocols.pbcast.GMS --> view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.1 91:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.13:38538, 172.16.11.200:32790]
                            org.jgroups.protocols.pbcast.GMS --> [local_addr=172.16.11.20:35858] view is [172.16.11.20:35858|259] [172.16.11.20:358 58, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]
                            org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.12:40087 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                            org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.10:51918 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                            org.jgroups.protocols.UDP --> sending msg to 172.16.11.20:35858 (src=172.16.11.20:35858), headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UDP=[channe l_name=ProvCache-LABS], UNICAST=[UNICAST: DATA, seqno=1]}
                            org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.20:35858 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                            org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.11:51210 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                            org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.191:37204 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:5 1918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
                            org.jgroups.protocols.pbcast.GMS --> failed to collect all ACKs (11) for view [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790] after 2000ms, missing ACKs from [172.16.11.13:38513, 172.16.11.13:38515, 172.16.11.13:38520, 172.16.11.13:38533] (received=[172.16.11.11:51210, 172.16.11.20:35858, 172.16.11.1 91:37204, 172.16.11.12:40087, 172.16.11.10:51918]), local_addr=172.16.11.20:35858
                            org.jgroups.protocols.UDP --> sending msg to 172.16.11.200:32790 (src=172.16.11.20:35858), headers are {GMS=GmsHeader[JOIN_RSP]: join_rsp=view: [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], digest: 172.16.11.11:51210: [0 : 0], 172.16.11.13:38513: [0 : 0], 172.16.11.10:51918: [4481 : 4482], 172.16.11.12:4008 7: [0 : 0], 172.16.11.13:38520: [0 : 0], 172.16.11.200:32790: [0 : 0], 172.16.11.20:35858: [3781 : 3782], 172.16.11.13 :38533: [0 : 0], 172.16.11.191:37204: [3685 : 3686], UDP=[channel_name=ProvCache-LABS], UNICAST=[UNICAST: DATA , seqno=1]}
                            org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.200:32790 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533,172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=2], UDP=[channel_name=ProvCache-LABS]}



                            org.jgroups.protocols.UDP --> message is [dst: 224.7.8.9:45567, src: 172.16.11.12:40087 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=ProvCache-LABS], FD=[FD: SUSPECT (suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533], from=172.16.11.12:40087)]}
                            org.jgroups.protocols.FD --> [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533], from=172.16.11.12:40087)]
                            org.jgroups.protocols.VERIFY_SUSPECT --> verifying that 172.16.11.13:38513 is dead
                            org.jgroups.protocols.UDP --> sending msg to 172.16.11.13:38513 (src=172.16.11.10:51918), headers are {VERIFY_SUSPECT=[VERIFY_SUSPECT: ARE_YOU_DEAD], UDP=[channel_name=ProvCache-LABS]}
                            org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38513 is dead (passing up SUSPECT event)
                            org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38533 is dead (passing up SUSPECT event)
                            org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38520 is dead (passing up SUSPECT event)
                            org.jgroups.protocols.pbcast.GMS --> processing [SUSPECT(172.16.11.13:38513), SUSPECT(172.16.11.13:38533), SUSPECT(172.16.11.13:38520)]
                            org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38513
                            org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38533
                            org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38520
                            org.jgroups.protocols.pbcast.GMS --> suspected members=[172.16.11.13:38513, 172.16.11.13:38533, 172.16.11.13:38520], suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38533, 172.16.11.13:38520]
                            



                            As per these logs, the co-ordinator identifies the dead members correctly but don't update the view properly, please advice on this

                            Please tell us how to overcome...

                            • 11. Re: FD Issue
                              karnivas

                              Just to give an update on this, it was identified that the dead member's(suspected member's) neighbor was having a wrong gateway entry in the routing table configuration.

                              Instead of having its own IP it was pointing to another IP which was also in the same cluster.

                              Once that instance was restarted the VIEW was updated successfully and replicated to all members.

                              Would this have been the cause for the problem ?

                              Please clarify on this to further analyze on this.

                              • 12. Re: FD Issue
                                karnivas

                                Any thoughts on this ???
                                Please reply.