2 Replies Latest reply on Mar 6, 2007 3:26 AM by kkrivopustov

Retrieving state problem

kkrivopustov Mar 2, 2007 11:45 AM

Hello,
we have a cluster with 2 nodes, and have to restart one server while other is working. Sometimes after several restarts of each node in turn the TreeCache doesn't retrieve the state on startup, because it doesn't see other node. After some time it finds the other node, but it doesn't affect its state. This problem is very serious for us, because we store user sessions in the cache, so if the new node doesn't receive the state from existing node, all user requests to the new node fail.

Here is log excerpt from the starting node:

2007-03-02 19:45:08,367 DEBUG [org.jboss.cache.TreeCache] Starting jboss.cache:service=GearSessionsTreeCache
2007-03-02 19:45:11,878 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.3.71:7810|0] [192.168.3.71:7810]
2007-03-02 19:45:11,878 INFO [org.jboss.cache.TreeCache] TreeCache local address is 192.168.3.71:7810
2007-03-02 19:45:11,878 DEBUG [org.jboss.cache.TreeCache] transferred state is null (may be first member in cluster)
2007-03-02 19:45:11,894 INFO [org.jboss.cache.TreeCache] State could not be retrieved (we are the first member in group)
...
2007-03-02 19:45:24,192 INFO [org.jboss.cache.TreeCache] viewAccepted(): MergeView::[192.168.3.65:7810|5] [192.168.3.65:7810, 192.168.3.71:7810], subgroups=[[192.168.3.71:7810|4] [192.168.3.65:7810], [192.168.3.71:7810|0] [192.168.3.71:7810]]

We use TCP stack of JGroups:

 <TCP bind_addr="192.168.3.71" start_port="7810" loopback="true"/>
 <TCPPING initial_hosts="192.168.3.65[7810]"
 port_range="3"
 timeout="3500"
 num_initial_members="3"
 up_thread="true"
 down_thread="true"/>
 <MERGE2 min_interval="5000" max_interval="10000"/>
 <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
 <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
 <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100" retransmit_timeout="3000" />
 <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
 <pbcast.GMS join_timeout="5000"
 join_retry_timeout="2000"
 shun="false"
 print_local_addr="false"
 down_thread="true"
 up_thread="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />

Any help with this would be very appreciated...

1. Re: Retrieving state problem

manik Mar 5, 2007 7:32 AM (in response to kkrivopustov)

This is because when the node comes on line it thinks it is the first node in the cluster. The merge view that is sent out later will not trigger a state transfer event.

Do you see anything on the other node's logs about this node coming up?

Also, have you tried increasing timeounts on TCPPING to let the node have more time to find the rest of the cluster before it decides that it is alone in the cluster?
Actions
2. Re: Retrieving state problem

kkrivopustov Mar 6, 2007 3:26 AM (in response to kkrivopustov)

Thank you for reply.

Two days ago I set logging level for org.jgroups to INFO (previously it was WARN), and the problem magically disappeared - I still can't reproduce it. Now it says "created socket to ..." on the new node and "input_cookie is bela" on the master node.

If it appears again, I will play with logging and TCPPING timeouts.
Actions

Go to original post