4 Replies Latest reply on May 15, 2007 4:41 AM by anre42

    Serious production problems

    marlig

      Hi,

      we are using JBoss 4.0.4 under Solaris 5.8 and JDK 1.5 in production. JBoss is fronted by Apache (2.0.43) with mod_jk (1.2.19) for load-balancing the web-frontend. We have two JBoss cluster nodes running.

      Every couple of days (3-5 days) we run into serious problems which ultimately require the restart of the whole JBoss cluster. The first sign is that the number of Apache processes increases rapidly (from around 30 to well over 200), and then we get an OutOfMemory error on one of the JBoss nodes. However, we are not sure here what is cause and what is effect.

      But we see one behaviour where we believe it might be a part of the problem. We were able to recover some information while the server was having the OutOfMemory erros, and noticed that some threads seem to be stuck while doing some sort of socket access.

      The Java stack dump shows entries like this:

      Thread t@125: (state = IN_NATIVE)
      Error occurred during stack walking:
      

      (nothing after the colon)

      for which jstack -m shows this:
      ----------------- t@125 -----------------
      0xff29eccc _read + 0x8
      0xfacdc1a4 Java_java_net_SocketInputStream_socketRead0 + 0x1fc
      

      (just that)

      Has anyone of you seen this kind of problem before? Or maybe knows that it is the normal behaviour and definitely not the source of our OutOfMemorys?

      Btw, we also noticed that during the OutOfMemory situation, the memory consumed by byte[]-arrays is much higher (magnitude of 10) than during normal operation.

      Maybe any other idea on the problem?

      Thanks a lot
      Martin

        • 1. Re: Serious production problems
          anre42

          We are struggling with the same kind of problems. Tomcat gets unresponsive after a while and sometimes we can see an OutOfMemoryError but not always.

          We are running jboss 4.0.3_SP1 in a cluster with apache in front, java version is 1.5.0_11 and os is RedaHat EL.

          The problem seams only to affect tomcat and the rest of jboss e.g. seams to run ok. From jstats we see that most threads are "BLOCKED" and the ones that are not are in "IN_NATIVE" state doing either socketAccespt, socketRead or receive. We cannot see any correlation to the load on the server, we can provoke this with only one user. However, it occurs most irregular, sometimes several times pers day sometimes a week can go by.

          Would appreciate all hints that could help us solve this problem

          Cheers!

          /Andras

          • 2. Re: Serious production problems
            marlig

            Hi Andras,

            it's good to know we are not the only ones.

            We are still struggling to reproduce this problem on our test environments. When you say you can provoke it sometimes, what exactly are you doing? We tried with very heavy load, but never got any problems during our tests. It's just in production, and only every couple of days, and that's very annoying.

            Greetings
            Martin

            • 3. Re: Serious production problems
              visprar

              We are also facing similar problems. Running 2 clusters with 4 nodes. Every 1-2 days all nodes in a cluster get locked up.

              Have no idea on what needs to be done.

              Does anyone know under what condition the entire node can be affected. Th eonly thing i found was

              "Also, a slow member could slow or even prevent purging of stable messages (http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsPbcastSTABLE), so in the worst case, all members could run out of memory because they would never purge stable messages. Exclusion of such a member resumes progress in the stability protocol.

              +

              http://wiki.jboss.org/wiki/Wiki.jsp?page=Shunning"

              • 4. Re: Serious production problems
                anre42

                What I meant by provoke it with one user is that we have had the problem in production with only one test-user logged in, so it doesn't seam to be connected to the load on the system. We have now set up a replica (as close as possible at least) of the prod env for testing but not been able to reproduce the problem.

                We have to get more info about what is actually going on inside jboss/tomcat when this happens. Currently we don't have any good analyzing-tools at least not any that we can put in the prod env which is the only place we have these problem. Does anyone have any good ideas about tools or so we can use to analyze the state and the bevaiour of jboss?

                /Andreas