8 Replies Latest reply on Jun 9, 2009 12:18 PM by mazz

    Jopr 2.2 server crashing sporadically

    fbrueseke

      Hi Jopr experts.

      I am currently load testing my measurement program with Jopr at the very heart of it. After about ~5.5h of operating with only the usual errors the server just crashes. I get a strange error message and then the server goes on and the crash occurs shortly after this. The following are the last lines in the "rhq-server-log4j.log" logfile:

      2009-06-05 19:22:56,246 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] Purging availablities that are older than Thu Jun 05 19:22:56 CEST 2008
      2009-06-05 19:22:58,121 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] Availability data purged [0] - completed in [1875]ms
      2009-06-05 19:22:58,121 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] Database maintenance starting at Fri Jun 05 19:22:58 CEST 2009
      2009-06-05 19:22:58,121 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] Performing hourly database maintenance
      2009-06-05 19:25:09,701 ERROR [org.jboss.remoting.transport.socket.ServerThread] Worker thread initialization failure
      java.lang.reflect.InvocationTargetException
       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
       at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
       at org.jboss.remoting.transport.socket.ServerThread.createServerSocketWrapper(ServerThread.java:706)
       at org.jboss.remoting.transport.socket.ServerThread.dorun(ServerThread.java:364)
       at org.jboss.remoting.transport.socket.ServerThread.run(ServerThread.java:165)
      Caused by: java.net.SocketException: Software caused connection abort: recv failed
       at java.net.SocketInputStream.socketRead0(Native Method)
       at java.net.SocketInputStream.read(SocketInputStream.java:129)
       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
       at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
       at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
       at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2213)
       at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2226)
       at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2694)
       at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:761)
       at java.io.ObjectInputStream.<init>(ObjectInputStream.java:277)
       at org.jboss.remoting.loading.ObjectInputStreamWithClassLoader.<init>(ObjectInputStreamWithClassLoader.java:95)
       at org.jboss.remoting.serialization.impl.java.JavaSerializationManager.createInput(JavaSerializationManager.java:54)
       at org.jboss.remoting.marshal.serializable.SerializableUnMarshaller.getMarshallingStream(SerializableUnMarshaller.java:72)
       at org.jboss.remoting.marshal.serializable.SerializableUnMarshaller.getMarshallingStream(SerializableUnMarshaller.java:55)
       at org.jboss.remoting.transport.socket.ClientSocketWrapper.createInputStream(ClientSocketWrapper.java:185)
       at org.jboss.remoting.transport.socket.ClientSocketWrapper.createStreams(ClientSocketWrapper.java:164)
       at org.jboss.remoting.transport.socket.ClientSocketWrapper.<init>(ClientSocketWrapper.java:66)
       at org.jboss.remoting.transport.socket.ServerSocketWrapper.<init>(ServerSocketWrapper.java:46)
       ... 7 more
      2009-06-05 19:25:55,571 WARN [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [PADFBRUESEKE1.pad.orga-systems.net] since [Fri Jun 05 19:23:50 CEST 2009]. Will be backfilled since we suspect it is down
      2009-06-05 19:26:22,694 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] Agent [PADFBRUESEKE1.pad.orga-systems.net][1.2.0(3872)] would like to connect to this server
      2009-06-05 19:27:09,236 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] Agent [PADFBRUESEKE1.pad.orga-systems.net] has connected to this server at Fri Jun 05 19:27:09 CEST 2009
      2009-06-05 19:27:28,906 INFO [org.rhq.enterprise.server.measurement.MeasurementServerServiceImpl] Performance: measurement merge [20] timing (19467)ms
      2009-06-05 19:27:29,499 INFO [org.rhq.enterprise.server.measurement.MeasurementServerServiceImpl] Performance: measurement merge [623] timing (20060)ms


      I am not sure if this Exception is correlated with the server crash. I also can see from my load test results that the system load was unusually high at that time. The measured system's throughput was exceptionally low for about 15min (i.e. in 3 five minute intervals) and it went right up after Jopr server was down.

      I have experienced this behaviour for a few times now and have no explanation whatsoever. So, what log configuration changes would you advocate to shed some light on this issue?

      Kind regards
      Frank

        • 1. Re: Jopr 2.2 server crashing sporadically
          mazz

          When you say "the server just crashes", do you mean the Jopr Server JVM process gets killed and goes away?

          We've never seen this short of perhaps the JVM running out of memory and the JVM crashing as a result. Do you see any "hs_err*" log files anywhere under the Jopr Server install directory (possibly in the bin directory)? When JVMs crash (when the JVM process actually dies and goes away), it will usually dump an hs_err* file that contains information about what the JVM was doing at the time of the crash.

          Caused by: java.net.SocketException: Software caused connection abort: recv failed


          This exception probably occurs due to some kind of timeout getting triggered, perhaps when load gets high on a machine (I'm guessing based on Googling that exception message - you might want to do the same and check out some of the causes of that exception). You can see one of your agents is soon after considered down - this exception was probably caused when trying to read a message coming from the agent and the agent went down (do you have an agent running on the same machine as your server? does it crash too? check its agent logs).

          You will notice that these error messages occur at the same time Jopr is doing database maintenance. What database are you using? I have a feeling this entire problem stems from your database not being powerful enough (or misconfigured) to handle the load you are putting on it - that might cause some cascading failures inside the server (and possibly all the processes on that box). Is your database running on the same machine as your server?

          So, to summarize the questions:

          1. Do you mean the Jopr Server JVM process crashes and dies?
          2. Do you get any hs_err* files somewhere under your server install dir? server bin/ dir?
          3. Is an agent running on the same box as your server? Does it crash too?
          4. What database are you using? (what is the vendor and version number)
          5. Is your database running on the same box as your server?

          • 2. Re: Jopr 2.2 server crashing sporadically
            fbrueseke

            Hi mazz,

            thanks for your quick reply.

            "mazz" wrote:
            Googling that exception message - you might want to do the same

            I have done that turns it could be due to running everything on Windows XP. Seems like I have to move my test system away from my workstation :(

            "mazz" wrote:
            So, to summarize the questions:

            1. Do you mean the Jopr Server JVM process crashes and dies?
            2. Do you get any hs_err* files somewhere under your server install dir? server bin/ dir?
            3. Is an agent running on the same box as your server? Does it crash too?
            4. What database are you using? (what is the vendor and version number)
            5. Is your database running on the same box as your server?

            1. Yes the process crashes and dies, but ...
            2. there is no hs_err* file on my system.
            3. The agent as well as the database is running on the same host. The agent does not crash!
            4. Postgres 8.2.7; and the Jopr database is getting kind of big (~800MB)
            5. Yes

            So what I can gather from your answer is this:
            Either database or Jopr server (probably DB) is wreaking havoc, i.e. using up a lot or most of the resources, due to database maintenance. This causes the agent to get so little resources that it can't talk to the server which thinks it is offline.

            Well, but this is pure speculation. Is there any means how I can proof what is happening, e.g. by raising server log level or something alike?
            What is more important, how can I prevent this from happening?
            What do you think?

            Kind regards
            Frank

            • 3. Re: Jopr 2.2 server crashing sporadically
              mazz

              FWIW, I run almost all of my development and testing on XP. That in and of itself probably isn't the problem. But I suspect because you are running everything (server, agent, database) on the same box, that its just too much and your hardware can't handle it for the amount of processing you are asking it to do. I run all of those on my XP box too - but I do very light testing. All of our perf testing occurs on beefier boxes with tuned databases.

              How much RAM do you have on that box? I can tell you one thing we hit recently that may be relevent. When we start the Jopr server, we set -Xms to a lower number than -Xmx. So the VM starts with some heap, but over time can grow to the value of -Xmx. In this one particular case, it just so happened that the machine didn't have enough memory to support the -Xmx setting! So even though we told the VM it could grow to -Xmx1024m, the machine didn't have that much free memory and the server got an out-of-memory-error after running fine for some time (I can't remember if the server process died or not but the OOM put the server in a very bad state). This is the JIRA that grew out of that: http://jira.rhq-project.org/browse/RHQ-2108. We now set -Xms == -Xmx and the same for perm gen space - that fix went into trunk and will be in the next release. This will at least force the server to immediately bomb at startup letting us know right away that our box doesn't have enough memory.

              Other than that, I think I would feel safe to say that you are taxing you system too much for what you are monitoring and processing - and I would first look to the database as being the main culprit. Without a good tuned database AND with a good/fast disk(s) for the database to use, you should be prepared for less-than-optimal performance.

              As for how to debug this, one thing we do is actually use Jopr itself to monitor itself! If your system slowly degrades over time, its possible to see the degradation and perhaps determine what needs to be fixed. We have a pretty good postgres plugin that ships with Jopr - import your postgres server and monitor it using Jopr. Take a look at the monitoring graphs and see if, over time, something starts to degrade.

              You could also look at the platform's monitoring graphs too - the platform plugin provides some Windows monitoring data (your top platform resource will show you this data). Perhaps that can show you things like memory usage, disk usage and the like. Also take a look at the Jopr Server itself - if you import it, you'll get measurment data for it. Of course, all of this will add to the data you need to insert into the database, so you may exaserbate the problem :) but it should still help. We have lots of measurement data we can collect on the postgres server, the windows platform, the Jopr Server and the Jopr Server internal services. You can speed up your collection intervals (down from 10 minutes to say 1 or 2 minutes) if your system degrades fast and you need many datapoints for many measurements over a span of say 30 minutes to see a pattern).

              • 4. Re: Jopr 2.2 server crashing sporadically
                fbrueseke

                Hi mazz.

                Your pointer with my box running out of memory was a hit. Seems like as free memory gradually increases the box starts swapping itself to death.

                I'll try tweaking memory settings of the running Java programs.

                Kind regards
                Frank

                • 5. Re: Jopr 2.2 server crashing sporadically
                  fbrueseke

                  Hi mazz,

                  I ran a quick test during lunch break. The result was all programs were still running (test was too short obviously). Memory was pretty tight and the database is using up all the CPU cycles it can get. It didn't even stop when I shutdown both the measured application and the Jopr agent.

                  Seems like Jopr server really puts some pressure on the database. What do you think is this all due to database maintenance? Can I avoid such a DB behaviour somehow?

                  Kind regards
                  Frank

                  • 6. Re: Jopr 2.2 server crashing sporadically
                    mazz

                    The "database maintenance" performed hourly is a necessary evil. Some of what it does is:

                    a) purges old data to free up space for new data coming in. We do this because we do not expect customers to have unbounded disk space allocated to their databases (e.g. Google :) plus we have seen degradation (to the point of it just not work) with Postgres tables that approach 10MB+ in size. We try to keep table sizes in check, and purging them of old, obsolete data is the way to do this. So we delete very old alerts, events, availability data, et. al.

                    b) compresses measurement data to, again, free up space in the database for the same reasons as above. Measurement data comes in, potentially, like a firehose. We need a way to throttle the beating the database takes. We do this by performing a round-robin of measurement data in 16 raw measurement tables (I won't get into the details - see http://jopr.org/confluence/display/RHQ/Measurement+Subsystem if interested). When raw measurement data is older than 7 days, we "compress" it by taking averages and such and taking the aggregate data and inserting it into other measurement tables - eventually compressing the data so much that after data is over 1 year old, we completely delete it. Again, this is to free up DB space to make room for more measurement data.

                    c) re-indexes some tables to keep the database "tuned" and, in postgres's case, we periodically vaccum it (to guarantee it gets done, otherwise, its performance will suffer)

                    d) calculates out-of-bounds information and baseline information - this is needed to support the "problem resources" subsystem view (out-of-bounds measurements) and baselines are needed for alerts that depend on them (plus it provides analysis data for those monitoring the behavior of resources over a span of several days/weeks).

                    "It didn't even stop when I shutdown both the measured application and the Jopr agent."


                    That isn't surprising - depending on the size of your database, the hardware you have (CPU/RAM) and the disks you have (how fast they are), this re-indexing, vacuuming, deleting of rows, compressing of measurement data, etc can tax the DB and all of this runs in the Jopr Server (regardless of whether or not the agent/managed servers are up and collecting additional data).

                    FWIW, our perf environment we've tested with had several hundred agents monitoring over 100K resources and our database was bored :), it showed very good peformance (we are using an Oracle10g instance on a quad-CPU box with I think it has 8gig RAM - and we made sure we tuned Oracle to support what we need). The point is, you need a database that can perform in order to support the load you plan to put on it. And our recommendation is to NOT run the server and database on the same box and not on the same box as your managed resources. You have the trifecta :) So unless you have good hardware and a tuned DB (or you aren't managing alot of things and collecting few metrics) it doesn't surprise me that things like what you are experiencing happens.

                    Can I ask how many managed servers you have in inventory? What's the counts for "platforms" , "servers" and "services" that you have? Did you change any collection intervals for your metrics so they are getting collected faster than their defaults? Did you enable anything like events logging for your platform or servers? All of these kinds of things increase the load on the Jopr server cloud including its database.

                    • 7. Re: Jopr 2.2 server crashing sporadically
                      fbrueseke

                      Just to make sure you understand what I meant with "it didn't stop when I shutdown the agent". I meant that I waited for about 45min and even then it hadn't stopped.

                      Now for the the rest of your questions:
                      I have one platform and apart from the usual resource (RHQ, postgres) one server with two sub-server. One of them is a JVM node. The other is a server with 213 services having IIRC 17 metrics each. These 17 metrics are all call time data and they are collected at a 30 second interval. This has resulted so far in about 200MB of call time data in the DB. I have also lowered some of the JVM memory stats to ~1 min.

                      Well as it is now I will migrate the installation to a real server soon.

                      Kind regards
                      Frank

                      • 8. Re: Jopr 2.2 server crashing sporadically
                        mazz

                         

                        The other is a server with 213 services having IIRC 17 metrics each. These 17 metrics are all call time data and they are collected at a 30 second interval. This has resulted so far in about 200MB of call time data in the DB. I have also lowered some of the JVM memory stats to ~1 min.


                        213 services x 17 metrics = 3,621 measurements

                        At 30 seconds a piece, that's 7k measurements per minute.

                        Add your JVM stats getting collected at 1 min, that's alot for a deployment where everything is running on the same box.

                        This is why we default most of the out of box metrics to 5mins, 10 mins or 20 mins. Turning everything on full blast at 30 seconds or 60 seconds and you will run into the situation you are hitting unless you have a database that can handle the load.

                        If you've imported the Jopr Server resource, you can see that we collect measurements on the measurement subsystem itself (how many measurements were inserted for example). You can follow that data to see what the system is currently able to do on your deployment hardware.