3 Replies Latest reply on Aug 9, 2007 1:49 PM by Alexey Kharlamov

    JNDIDetector removes detections

    Alexey Kharlamov Newbie

      Hello!

      I have a custom clustering system built on top of JBoss Remoting 2.2.0.SP4. It uses HAJndi and JNDIDetector to manage and discover nodes. HAJndi is hosted on cluster coordinators. Everything has worked fine until we had gone to pre-production state.

      The network load has increased, and central nodes began to treat slave nodes as dead and remove them from the registry. However at this moment, the slave nodes communicate with the coordinators without any problem. So this is not a hardware failure. Also, I can setup TCP connection from a coordinator node to the lost slave node by telnet at the moment of failures.

      Below you can see a quote from the log. It seems the slave and the master race with each other to remove/insert detection object to JNDI.


      2007-07-04 10:04:50,406 DEBUG [org.jboss.remoting.detection.jndi.JNDIDetector] Removed detection Detection (org.jboss.remoting.detection.Detection@8cddac31)
      2007-07-04 10:05:56,848 DEBUG [org.jboss.remoting.detection.jndi.JNDIDetector] Removed detection Detection (org.jboss.remoting.detection.Detection@8cddac31)
      2


      The first thing I've found during Remoting code review is small timeout for connection validation and only 1 retry number. I'm going to patch the code and check if this will solve the problem. But I suppose there was a cause why the timeout is so small. And this does not seem as an appropriate solution anyway.

      I have another unusual artifact in slaves log:

      2007-07-04 00:00:56,447 INFO [org.jboss.remoting.transport.socket.MicroSocketClientInvoker] Received version 254: treating as end of file
      2007-07-04 00:00:56,447 INFO [org.jboss.remoting.transport.socket.MicroSocketClientInvoker] Received version 254: treating as end of file


      May be this connected with the first problem? What is this?

      Any hints or advances ?

        • 1. Re: JNDIDetector removes detections
          Alexey Kharlamov Newbie

          I have increased timeouts, and it has helped. However, I still experience the same problem, but less frequently.

          I suggest to separate connection liveness check into another thread and increase timeouts further. Will it work? I can write a patch, but I want to be sure this is sensible and will not damage something inside Remoting.

          Help from Remote guru will be very appreciated.

          • 2. Re: JNDIDetector removes detections
            Ron Sigal Master

            Hi Alexey,

            Sorry for the delay in responding. Have you solved your problem, or is it still outstanding?

            -Ron

            • 3. Re: JNDIDetector removes detections
              Alexey Kharlamov Newbie

              Hi Ron,

              No I did not resolve the problem completely. However, I am more aware about cause of the problem now.

              As I said there are two kind of servers, coordinators and slaves. The slaves are run with very high concurrency factor. i.e. load average is around 6-7. So sometimes they are unable to answer ping requests quickly enough.

              To make things work I patched JBoss Remoting source to increase connection checker timeout and tries number. This decreased failures rate downto acceptable level. Now we have failure event one or two times per hour.

              However, it would be good to improve JBoss Remoting by providing customization for the server failure detection parameters. Moreover, I think ConnectionValidator should make pauses between validity checks. Now it fires all validations in a row, so all of them may be ignored by overloaded slave server.

              Also it would be good idea to optimize jndi detector somehow. Now, all servers check liveness of others. With a big number of nodes, if only one server failed to connect by whatever reasons, global record in JNDI will be updated. But other servers may continue to see it. While number of cluster nodes will grow, the problem will appear more frequently.

              - Alexey