7 Replies Latest reply on Jan 25, 2007 8:04 AM by belaban

handleJoin(node:port) failed, retrying

dode Feb 20, 2006 6:50 AM

I have set up a cluster with four nodes and custom partition name following the documentation. I have a test server running in the same subnet, but with the DefaultPartition name.

All four production nodes and the test server were running and all worked fine, until node1 of the four prod. nodes has been taken down and did not come up again with the repeated warning "handleJoin(node1:port) failed, retrying". I then took down the other three prod. nodes as well, but now any of the nodes failed to come up with the same warning.

Finally I took down the test server as well and now I could start all four prod. nodes normally.

Now my question is, how can I avoid this? Should each cluster partition run on its own network? What is then the point with the partition name?

I read in a post describing a similar problem, that using TCP instead of UDP solved the problem. Should I do that as well? If yes, what are then the "initial_hosts", should it be "thishost + othernodes" on each node?

Thanks in advance
Torsten

1. Re: handleJoin(node:port) failed, retrying

belaban Feb 20, 2006 11:21 AM (in response to dode)

Hmm, do you have some logs available ? What's the diff between the prod and tests servers in terms of configuration ?
Have you seen http://wiki.jboss.org/wiki/Wiki.jsp?page=HandleJoinProblem ?
Actions
2. Re: handleJoin(node:port) failed, retrying

webservicesadmin Feb 20, 2006 12:07 PM (in response to dode)

Hi,
You didn't mention that whether you are using any webserver or not..
Any way make sure you are starting your instances using this command or not.
run -c node1 -Djboss.partition.name=CLUSTER_NAME

Regards,
Rajesh.G
Actions
3. Re: handleJoin(node:port) failed, retrying

dode Feb 20, 2006 12:20 PM (in response to dode)

"bela@jboss.com" wrote:
Hmm, do you have some logs available ? What's the diff between the prod and tests servers in terms of configuration ?
Have you seen http://wiki.jboss.org/wiki/Wiki.jsp?page=HandleJoinProblem ?

Yes, I have a lot of logs, what part would be interesting to look at? It seems to look fine until the warning occurs, and from then it is nothing but the warning repeating. The test server is configured identically, besides the partition name.
Thanks for the wiki link, I should have searched there as well! I will read it now of course and get back if I find out something interesting. I still have some time to play with the environment before it goes into production :-)

Torsten
Actions
4. Re: handleJoin(node:port) failed, retrying

dode Feb 20, 2006 12:28 PM (in response to dode)

"webservicesadmin" wrote:
Hi,
You didn't mention that whether you are using any webserver or not..
Any way make sure you are starting your instances using this command or not.
run -c node1 -Djboss.partition.name=CLUSTER_NAME

Regards,
Rajesh.G

What do you mean with "using any webserver"? The application includes a servlet, and I have set up HTTP failover/load balancing with Apache + mod_jk as described in the documentation. It works perfectly well.
Isn't the -c argument supposed to be used for the configuration such as all, default, ...?

Torsten
Actions
5. Re: handleJoin(node:port) failed, retrying

dode Feb 21, 2006 11:03 AM (in response to dode)

Hmmm strange, now I cannot reproduce the problem anymore. I haven't changed anything, but I can restart any node in any order, or all of them, and the same test server is still running in the same subnet.
Really weird, because yesterday, I couldn't get any of the nodes up until I took down the test server. Somehow I have the impression that the nodes were trying to contact a controller that wasn't available or unable to serve as controller (the test server???).

Anyway, more out of interest, I have now switched from UDP to TCP. It also seems to work fine, so let's see if the problem ever occurs again.

BR
Torsten
Actions
6. Re: handleJoin(node:port) failed, retrying

mazx Jan 25, 2007 7:22 AM (in response to dode)

In our environment this usually happens when a node gets "killed" (as in java process gets a 'kill -9' signal) and this is how we usually reproduce this problem.
Actions
7. Re: handleJoin(node:port) failed, retrying

belaban Jan 25, 2007 8:04 AM (in response to dode)

Make sure you have both FD and FD_SOCK in your stack
Actions

Go to original post