8 Replies Latest reply on Mar 23, 2003 2:36 AM by slaboure

    Lost node behavior

    usiwill

      Hello all,

      I have begun experimenting with Clustering under JBoss, and have a question regarding behavior of a cached interface to a SLSB deployed across my nodes. Here's my setup

      3 nodes each running JBoss 3.0.6 Java 2 SDK 1.4.1_01 - I have a Stateless Session Bean Deployed on each of these 3 nodes. The SLSB has a method that accepts an int as a parameter and simply writes a message to the server logs.

      I have a client running on a different server and am seeing the following behavior when I drop nodes out of the cluster:

      Client is properly distributing requests against all three nodes. If I drop node 3, the client pauses and after a lengthy period of time will properly distribute requests against the remaining two nodes. When I add node three back in, the client will continue to send requests to two nodes, and not the third.

      Is there a way to have the client detect that the third node has returned and to have it start using it again?

      Thanks,

      Will

        • 1. Re: Lost node behavior
          usiwill

          Sorry, forgot to mention I am running RedHat Linux 7.3 on all machines.

          -Will

          • 2. Re: Lost node behavior
            nick.mills

            You said you were caching the interface? It isn't a good idea to cache for clustering as the cluster topology is stored in the HAProxy(?) which gets downloaded anew everytime you do context.lookup, home.create (and ofcourse ejb method calls).

            Try not using your cache and repeat the test.

            • 3. Re: Lost node behavior
              nick.mills

              Maybe I should qualify the above. If you do cache you run the risk of session-timeout exceptions:
              "Could not activate; failed to recover session (session has been probably removed by session-timeout)"
              - or your application holding outdated topology information. And hence not being able to connect to your nodes.
              However if you catch these exceptions and re-try your context.lookup, etc you should be OK.

              That way you can have your cake and eat it too.
              I could be wrong, but Maybe SAcha has something to say.

              • 4. Re: Lost node behavior
                usiwill

                > Maybe I should qualify the above. If you do cache you
                > run the risk of session-timeout exceptions:

                Nick,

                Thanks for your replies.

                My understanding was that if you didn't cache, the requests to the session bean would not round-robin. Am I wrong about this?

                Thanks,

                Will

                • 5. Re: Lost node behavior
                  slaboure

                  Hello,

                  what you do is correct, the behaviour you get is strange.

                  When you say "when I drop node3", what do you mean? do you kill it or do you simple disconnect it?

                  If you kill it and it doesn't work when you restart it, then that's very strange.

                  If you remove it from the network, could you please try to modify your cluster-service.xml file and, in the HAPartition definition, modify the GMS protocol: set the shun attribute to true.

                  Cheers,


                  sacha

                  • 6. Re: Lost node behavior
                    usiwill

                    Sacha,

                    Thanks for your response. When I said drop node 3, I meant that I am just unplugging the patch cable from my switch. I am not shutting down Jboss.

                    What I am seeing is that when I unplug one of the nodes, the client freezes (each thread blocks on the request it was trying to process). When I plug the node back in, the client is able to continue where it left off.

                    Setting shun attribute to true did not change the bahaviour.

                    The only modification I made to cluster-service.xml was to add the bind_addr attribute to the UDP tag. I had to do this because the server has multiple interfaces.

                    Thanks,

                    Will

                    • 7. Re: Lost node behavior
                      usiwill

                      Sacha,

                      Thanks for your response. When I said drop node 3, I meant that I am just unplugging the patch cable from my switch. I am not shutting down Jboss.

                      What I am seeing is that when I unplug one of the nodes, the client freezes (each thread blocks on the request it was trying to process). When I plug the node back in, the client is able to continue where it left off.

                      Setting shun attribute to true did not change the bahaviour.

                      The only modification I made to cluster-service.xml was to add the bind_addr attribute to the UDP tag. I had to do this because the server has multiple interfaces.

                      Thanks,

                      Will

                      • 8. Re: Lost node behavior
                        slaboure

                        Hello Will,

                        that's probably a correct behaviour. You focus on a single client, let's take a more global approach.

                        1°) your client
                        When you disconnect server 3, your client had most probably a connection opened with it (and was exchanging data). So, when you disconnect it, your client has no way to differentiate between a dead node and a node that is hard to reach (network congestion, slow server, etc.) => it has to use a timeout. If you wait long enough, you client will correctly failover. To change the timeout, take a look at the RMI property in the sun documentation.

                        2°) other clients
                        If you run other clients at the same time, you will see that clients that had no existing connection with node 3 will correctly work without being disturbed.

                        Furthermore, it always take much more time to detect an unplugged node than a running and plugged node with a dead jboss instance. In the first case, only timeout can be used. In the second case, the OS can very quickly answer and say: nobody is waiting on this TCP port.

                        Cheers,



                        Sacha