12 Replies Latest reply on Apr 15, 2009 4:57 AM by karnivas

FD Issue

karnivas Mar 26, 2009 5:31 AM

We are using JBoss Cache 1.4.1 SP6 with JGroups 2.4.x
We have a cluster of cache instances with two Sun Solaris and multiple RHEL machines.

When one of the RHEL instance is restarted, the VIEW of the cache instances in SOLARIS machines aren't updated.
i.e. viewAccepted() - Still has the old RHEL instance along with the new RHEL instance(which was restarted)

eg: [172.16.11.200:65261, 172.16.11.12:50903, 172.16.11.10:41912, 172.16.11.20:51156, 172.16.11.10:43789, 172.16.11.20:57771, 172.16.11.10:51722, 172.16.11.20:35858, 172.16.11.11:51210]

172.16.11.10 - RHEL Instance 1
172.16.11.20 - RHEL Instance 2

Its assumed that when a cache instance goes down the view should be immediately when FD_SOCK is configured. But it wasn't updated as expected.

Whereas the viewAccepted() was updated with active members and got resolved after some hours only.

We got a ReplicationException timeout

Received Throwable from remote node org.jboss.cache.ReplicationException: rsp=sender=172.16.11.10:41912, retval=null, received=false, suspected=false

The code is as follows

<attribute name="ClusterConfig">
 <config>
 <!-- UDP: if you have a multihomed machine,
 set the bind_addr attribute to the appropriate NIC IP address, e.g bind_addr="192.168.0.2"
 -->
 <!-- UDP: On Windows machines, because of the media sense feature
 being broken with multicast (even after disabling media sense)
 set the loopback attribute to true -->
 <UDP mcast_addr="224.7.8.9" mcast_port="45567"
 ip_ttl="64" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
 loopback="true" bind_addr="16.150.24.69"/>
 <PING timeout="2000" num_initial_members="3"
 up_thread="false" down_thread="false"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <!-- <FD shun="true" up_thread="true" down_thread="true" />-->
 <FD_SOCK/>
 <VERIFY_SUSPECT timeout="1500"
 up_thread="false" down_thread="false"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
 max_xmit_size="8192" up_thread="false" down_thread="false"/>
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
 down_thread="false"/>
 <pbcast.STABLE desired_avg_gossip="20000"
 up_thread="false" down_thread="false"/>
 <FRAG frag_size="8192"
 down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
 shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </config>
 </attribute>

From the exception message we infer that 172.16.11.10:41912, this cache instance has been restarted and the current active instance was 172.16.11.10:51722

1. Re: FD Issue

karnivas Mar 26, 2009 5:34 AM (in response to karnivas)

The added code was the sample so the IP(bind_addr) is different but all the other configuration is the same.
Actions
2. Re: FD Issue

brian.stansberry Mar 26, 2009 1:02 PM (in response to karnivas)

Use both FD and FD_SOCK. See explanation at http://www.jboss.org/community/docs/DOC-9739
Actions

3. Re: FD Issue

karnivas Mar 26, 2009 1:56 PM (in response to karnivas)

When one of the RHEL instance was brought down by killing the instance forcefully (kill -9 {PID}), the VIEW in the SOLARIS machines aren't updated even after testing with your suggestion.

The sample config is as follows

<attribute name="ClusterConfig">
 <config>
 <!-- UDP: if you have a multihomed machine,
 set the bind_addr attribute to the appropriate NIC IP address, e.g bind_addr="192.168.0.2"
 -->
 <!-- UDP: On Windows machines, because of the media sense feature
 being broken with multicast (even after disabling media sense)
 set the loopback attribute to true -->
 <UDP mcast_addr="224.7.8.9" mcast_port="45567"
 ip_ttl="64" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="200000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="200000"
 loopback="false" bind_addr="XX.XX.XX.XX"/>
 <PING timeout="2000" num_initial_members="3"
 up_thread="false" down_thread="false"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD shun="true" up_thread="true" down_thread="true" />
 <FD_SOCK/>
 <VERIFY_SUSPECT timeout="1500"
 up_thread="false" down_thread="false"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
 max_xmit_size="8192" up_thread="false" down_thread="false" discard_delivered_msgs="true"/>
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
 down_thread="false"/>
 <pbcast.STABLE desired_avg_gossip="20000"
 up_thread="false" down_thread="false"/>
 <FRAG frag_size="8192"
 down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
 shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </config>
 </attribute>

Please suggest your thoughts in this.
Solaris and RHEL KEEP_ALIVE is with default(2hours) values

4. Re: FD Issue

brian.stansberry Mar 26, 2009 3:34 PM (in response to karnivas)

With the config the old RHEL server instance should be removed from the view after 7.5 seconds (6 secs for FD to suspect it, plus 1.5 for VERIFY_SUSPECT).

Are you saying you are still seeing a view that has two incarnations of the RHEL server in it? Is this after at least 7.5 seconds have passed?
Actions
5. Re: FD Issue

karnivas Mar 27, 2009 2:31 AM (in response to karnivas)

YES. The viewAccepted in the SOLARIS machines still have the dead members after some hours also.

How can these members be deleted from the VIEW ?
Should we set any other value to resolve this ??
Actions
6. Re: FD Issue

karnivas Mar 27, 2009 5:27 AM (in response to karnivas)

One of the RHEL instances has detected the failure and has the following messages

org.jgroups.protocols.FD --> [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[xx.xx.xx.xx:xx)]
org.jgroups.protocols.VERIFY_SUSPECT --> verifying that xx.xx.xx.xx is dead

But this message is not updated to the coordinator(SOLARIS) instance, so the VIEW is still having the dead members.
Actions
7. Re: FD Issue

karnivas Mar 30, 2009 8:48 AM (in response to karnivas)

We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP8 but the problem still exists
Actions
8. Re: FD Issue

karnivas Mar 30, 2009 8:50 AM (in response to karnivas)

"karnivas" wrote:
We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP8 but the problem still exists

We tested the same with JBoss Cache 1.4.1 SP8 with JGroups 2.4.1.SP4 but the problem still exists
Actions
9. Re: FD Issue

brian.stansberry Mar 30, 2009 12:40 PM (in response to karnivas)

First, I recommend you try the tests discussed at http://www.jboss.org/community/docs/DOC-12375 and http://www.jgroups.org/manual/html/ch02.html#ItDoesntWork. There could be a communication problem between the RHEL nodes and the Solaris nodes.

If all is well there, recommend you turn on TRACE logging for FD, VERIFY_SUSPECT and GMS and see if that tells you anything more. If not, enabled TRACE for all org.jgroups classes.
Actions

10. Re: FD Issue

karnivas Apr 2, 2009 12:59 PM (in response to karnivas)

We brought down one of the SOLARIS machine(P1 - Co-ordinator) to check the view in all machines.

As expected, the co-ordinator changed to one of the RHEL machine by removing the P1 from all views, but the dead RHEL members wasn't updated in the VIEW

Please find the DEBUG messages of jgroups.log

org.jgroups.protocols.pbcast.GMS --> new=[172.16.11.200:32790], suspected=[], leaving=[], new view: [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]
org.jgroups.protocols.pbcast.GMS --> mcasting view {[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]} (9 mbrs)
org.jgroups.protocols.UDP --> sending msg to null (src=172.16.11.20:35858), headers are {NAKACK=[MSG, seqno=3782], GMS= GmsHeader[VIEW]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UDP =[channel_name=ProvCache-LABS]}
org.jgroups.protocols.UDP --> message is [dst: 224.7.8.9:45567, src: 172.16.11.20:35858 (3 headers), size = 0 bytes], h eaders are {GMS=GmsHeader[VIEW]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 17 2.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11 .200:32790], NAKACK=[MSG, seqno=3782], UDP=[channel_name=ProvCache-LABS]}
org.jgroups.protocols.pbcast.GMS --> view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.1 91:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.13:38538, 172.16.11.200:32790]
org.jgroups.protocols.pbcast.GMS --> [local_addr=172.16.11.20:35858] view is [172.16.11.20:35858|259] [172.16.11.20:358 58, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]
org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.12:40087 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.10:51918 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
org.jgroups.protocols.UDP --> sending msg to 172.16.11.20:35858 (src=172.16.11.20:35858), headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UDP=[channe l_name=ProvCache-LABS], UNICAST=[UNICAST: DATA, seqno=1]}
org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.20:35858 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.11:51210 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.191:37204 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:5 1918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=1], UDP=[channel_name=ProvCache-LABS]}
org.jgroups.protocols.pbcast.GMS --> failed to collect all ACKs (11) for view [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790] after 2000ms, missing ACKs from [172.16.11.13:38513, 172.16.11.13:38515, 172.16.11.13:38520, 172.16.11.13:38533] (received=[172.16.11.11:51210, 172.16.11.20:35858, 172.16.11.1 91:37204, 172.16.11.12:40087, 172.16.11.10:51918]), local_addr=172.16.11.20:35858
org.jgroups.protocols.UDP --> sending msg to 172.16.11.200:32790 (src=172.16.11.20:35858), headers are {GMS=GmsHeader[JOIN_RSP]: join_rsp=view: [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], digest: 172.16.11.11:51210: [0 : 0], 172.16.11.13:38513: [0 : 0], 172.16.11.10:51918: [4481 : 4482], 172.16.11.12:4008 7: [0 : 0], 172.16.11.13:38520: [0 : 0], 172.16.11.200:32790: [0 : 0], 172.16.11.20:35858: [3781 : 3782], 172.16.11.13 :38533: [0 : 0], 172.16.11.191:37204: [3685 : 3686], UDP=[channel_name=ProvCache-LABS], UNICAST=[UNICAST: DATA , seqno=1]}
org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src: 172.16.11.200:32790 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]: view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533,172.16.11.200:32790], UNICAST=[UNICAST: DATA, seqno=2], UDP=[channel_name=ProvCache-LABS]}

org.jgroups.protocols.UDP --> message is [dst: 224.7.8.9:45567, src: 172.16.11.12:40087 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=ProvCache-LABS], FD=[FD: SUSPECT (suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533], from=172.16.11.12:40087)]}
org.jgroups.protocols.FD --> [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533], from=172.16.11.12:40087)]
org.jgroups.protocols.VERIFY_SUSPECT --> verifying that 172.16.11.13:38513 is dead
org.jgroups.protocols.UDP --> sending msg to 172.16.11.13:38513 (src=172.16.11.10:51918), headers are {VERIFY_SUSPECT=[VERIFY_SUSPECT: ARE_YOU_DEAD], UDP=[channel_name=ProvCache-LABS]}
org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38513 is dead (passing up SUSPECT event)
org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38533 is dead (passing up SUSPECT event)
org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38520 is dead (passing up SUSPECT event)
org.jgroups.protocols.pbcast.GMS --> processing [SUSPECT(172.16.11.13:38513), SUSPECT(172.16.11.13:38533), SUSPECT(172.16.11.13:38520)]
org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38513
org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38533
org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38520
org.jgroups.protocols.pbcast.GMS --> suspected members=[172.16.11.13:38513, 172.16.11.13:38533, 172.16.11.13:38520], suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38533, 172.16.11.13:38520]

As per these logs, the co-ordinator identifies the dead members correctly but don't update the view properly, please advice on this

Please tell us how to overcome...

11. Re: FD Issue

karnivas Apr 14, 2009 1:00 PM (in response to karnivas)

Just to give an update on this, it was identified that the dead member's(suspected member's) neighbor was having a wrong gateway entry in the routing table configuration.

Instead of having its own IP it was pointing to another IP which was also in the same cluster.

Once that instance was restarted the VIEW was updated successfully and replicated to all members.

Would this have been the cause for the problem ?

Please clarify on this to further analyze on this.
Actions
12. Re: FD Issue

karnivas Apr 15, 2009 4:57 AM (in response to karnivas)

Any thoughts on this ???
Please reply.
Actions

Go to original post