7 Replies Latest reply on Feb 1, 2008 3:04 AM by aparolini88

    Handling of 'deployments taking ~1 minute' scenario

    galder.zamarreno

      Quite often, users/customers encounter the ~1 min deployments scenario that. A node that has this slow startup can in theory be considered a working node with the caveat that maybe some nodes don't have DRM information for EJB proxies for example. However, I'm yet to see a customer who was happy to carry on with such a slow node. IOW, a customer who considered a node with these symptoms to be healthy.

      I covered debugging such scenario in http://www.jboss.com/index.html?module=bb&op=viewtopic&t=127721 thread, but should we react to this scenario in a different way? Should we maybe evict the nodes failing to respond in time from the cluster? It might be a case that they're just slow, but if you kicked them out from the cluster, the next deployments would (potentially) work fine.

      Currently, we're only logging ignored missing responses as TRACE.

        • 1. Re: Handling of 'deployments taking ~1 minute' scenario
          brian.stansberry

          IIRC, the 1 min deployment scenario was due to a deadlock where the AS code used the JGroups up_handler to make an RPC, thus preventing the RPC response from arriving. Wasn't this a bug that was fixed?

          In that case, it was the node sending the RPC that was faulty. In some other case where a remote node "isn't responding" all you could do would be to send a message to "commit suicide" -- there's no mechanism to evict a node from the group outside of JGroups' own failure detection. But if the node isn't responding to RPCs, it likely wouldn't respond to the "commit suicide" either.

          Logically, I could see some benefit in some sort of self-healing approach where cluster members detect faults and restart themselves or send commands to others telling them to restart. But this will take a lot of thought.

          • 2. Re: Handling of 'deployments taking ~1 minute' scenario
            galder.zamarreno

            Yeah, that bug was fixed. In this latest case, the root cause was log4j that was logging over NFS and was having some issues in some cluster nodes. Logging over NFS is never a good idea, but I needed to be 100% sure that this is the cause of the slow deployments and not another JGroups bug ;).

            I was lucky enough that when digging through the case, I was able to match the nodes for which the RPC called failed to the logs of two nodes that showed log4j issues, "stale NFS handle".

            In some other case where a remote node "isn't responding" all you could do would be to send a message to "commit suicide" -- there's no mechanism to evict a node from the group outside of JGroups' own failure detection. But if the node isn't responding to RPCs, it likely wouldn't respond to the "commit suicide" either.


            If it wasn't responding to RPCs, FD/FD_SOCK eventually would discover that the node is not responding. In this case though, failure detection layer was Ok, so cluster was not dismantled, but something was wrong that was disrupting a healthy cluster. Customer was concerned about such scenario.

            Logically, I could see some benefit in some sort of self-healing approach where cluster members detect faults and restart themselves or send commands to others telling them to restart. But this will take a lot of thought.


            I'll fill in a JIRA tomorrow to track this.

            • 4. Re: Handling of 'deployments taking ~1 minute' scenario
              aparolini88

              Hi,

              I am the "customer" that incoutered this issue with log4j and NFS. In our TEST Jboss cluster, we solved this by installing all our JBOSS AS into a local file system insteed of NFS. Since then, it never happed again.

              But just this morning, we had the same issue in our PRODUCTION environement. Meaning that all EJB calls inter-jboss where dreadfully slow. We solved it by restarting one of our Jboss that was somehow "stuck" ( but not by log4j that time!).

              The problem is this: when just one Jboss server is stuck, the whole Jgroup cluster is stuck. This issue is terrible for us, because we have a big heterogenous cluster with 20 applications (nodes) on it. So when this issue araises, all the application are not working.

              My question is this: if we switch to several homogenous cluster ( each application has it how small cluster (partition) of 2 to 4 nodes), will this leverage the effect of this issue ?

              Thanks in advance.

              • 5. Re: Handling of 'deployments taking ~1 minute' scenario
                brian.stansberry

                It's hard for me to say for sure without knowing for sure what your issue is. I'll I can do is *assume* the issue was due to slow or non-existent replies to intra-cluster calls made by the jboss:service=DefaultPartition service.

                If that was the problem, then yes, using many smaller clusters could help isolate the problem to a few machines rather than having it effect all of them.

                But be careful about the assumption above. In most use cases there are not a lot of calls over the jboss:service=DefaultPartition after the nodes have started. If you are seeing continued cluster-wide issues after startup, there's a good chance this is not the issue.

                Please be sure to raise a support case on the customer support portal, where we have better tools available to diagnose your problem.

                • 6. Re: Handling of 'deployments taking ~1 minute' scenario
                  aparolini88

                  What we observed here is that under our Jgroup TCP stack (with a Gossip server), when one node stops answering intra-cluster RMI calls, we have those two side effects: 1 minute timeout per EJB on HAJNDI name binding at Jboss boot time, and 1 minute timeout on any HAJNDI EJB calls.

                  Ok. I'll reopen the previous support case I had with Galder Z.

                  • 7. Re: Handling of 'deployments taking ~1 minute' scenario
                    aparolini88

                    Correction: ", and 1 minute timeout on any HAJNDI EJB context lookup"