7 Replies Latest reply on Jan 25, 2007 8:04 AM by belaban

    handleJoin(node:port) failed, retrying

    dode

      I have set up a cluster with four nodes and custom partition name following the documentation. I have a test server running in the same subnet, but with the DefaultPartition name.

      All four production nodes and the test server were running and all worked fine, until node1 of the four prod. nodes has been taken down and did not come up again with the repeated warning "handleJoin(node1:port) failed, retrying". I then took down the other three prod. nodes as well, but now any of the nodes failed to come up with the same warning.

      Finally I took down the test server as well and now I could start all four prod. nodes normally.

      Now my question is, how can I avoid this? Should each cluster partition run on its own network? What is then the point with the partition name?

      I read in a post describing a similar problem, that using TCP instead of UDP solved the problem. Should I do that as well? If yes, what are then the "initial_hosts", should it be "thishost + othernodes" on each node?

      Thanks in advance
      Torsten

        • 1. Re: handleJoin(node:port) failed, retrying
          belaban

          Hmm, do you have some logs available ? What's the diff between the prod and tests servers in terms of configuration ?
          Have you seen http://wiki.jboss.org/wiki/Wiki.jsp?page=HandleJoinProblem ?

          • 2. Re: handleJoin(node:port) failed, retrying
            webservicesadmin

            Hi,
            You didn't mention that whether you are using any webserver or not..
            Any way make sure you are starting your instances using this command or not.
            run -c node1 -Djboss.partition.name=CLUSTER_NAME


            Regards,
            Rajesh.G

            • 3. Re: handleJoin(node:port) failed, retrying
              dode

               

              "bela@jboss.com" wrote:
              Hmm, do you have some logs available ? What's the diff between the prod and tests servers in terms of configuration ?
              Have you seen http://wiki.jboss.org/wiki/Wiki.jsp?page=HandleJoinProblem ?


              Yes, I have a lot of logs, what part would be interesting to look at? It seems to look fine until the warning occurs, and from then it is nothing but the warning repeating. The test server is configured identically, besides the partition name.
              Thanks for the wiki link, I should have searched there as well! I will read it now of course and get back if I find out something interesting. I still have some time to play with the environment before it goes into production :-)

              Torsten

              • 4. Re: handleJoin(node:port) failed, retrying
                dode

                 

                "webservicesadmin" wrote:
                Hi,
                You didn't mention that whether you are using any webserver or not..
                Any way make sure you are starting your instances using this command or not.
                run -c node1 -Djboss.partition.name=CLUSTER_NAME


                Regards,
                Rajesh.G


                What do you mean with "using any webserver"? The application includes a servlet, and I have set up HTTP failover/load balancing with Apache + mod_jk as described in the documentation. It works perfectly well.
                Isn't the -c argument supposed to be used for the configuration such as all, default, ...?

                Torsten

                • 5. Re: handleJoin(node:port) failed, retrying
                  dode

                  Hmmm strange, now I cannot reproduce the problem anymore. I haven't changed anything, but I can restart any node in any order, or all of them, and the same test server is still running in the same subnet.
                  Really weird, because yesterday, I couldn't get any of the nodes up until I took down the test server. Somehow I have the impression that the nodes were trying to contact a controller that wasn't available or unable to serve as controller (the test server???).

                  Anyway, more out of interest, I have now switched from UDP to TCP. It also seems to work fine, so let's see if the problem ever occurs again.

                  BR
                  Torsten

                  • 6. Re: handleJoin(node:port) failed, retrying
                    mazx

                    In our environment this usually happens when a node gets "killed" (as in java process gets a 'kill -9' signal) and this is how we usually reproduce this problem.

                    • 7. Re: handleJoin(node:port) failed, retrying
                      belaban

                      Make sure you have both FD and FD_SOCK in your stack