On our production app servers I'm seeing high system cpu time as compared with user cpu time. The ration is 1:1 which I feel can be improved upon ... Yes?/No?
We're running apache and JBoss on the same server with mod_jk connector and have narrowed the high system cpu time down to the JBoss process (JVM). I'm just starting to take a look inside JBoss and the JVM to see what's up and figured I'd drop a quick line in the forums too...
The System/User load increases evenly @ 1:1 directly related to load. The server is pushing around 4 Mbps. Apache is snoozing...
Looking for suggestions on where/how to look at what is going on.
Server is dual proc with 2 Gig
OS is REL4.5 with latest patches
java version "1.5.0_11"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_11-b03)
Java HotSpot(TM) Server VM (build 1.5.0_11-b03, mixed mode)
JBoss version 4.0.4GA(build: CVSTag=JBoss_4_0_4_GA date=200605151000)
Snapshot of the heap from one of the app servers below:
[root@app3 ~]# jmap -heap 20166
Attaching to process ID 20166, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 1.5.0_11-b03
using thread-local object allocation.
Parallel GC with 4 thread(s)
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 536870912 (512.0MB)
NewSize = 655360 (0.625MB)
MaxNewSize = 4294901760 (4095.9375MB)
OldSize = 1441792 (1.375MB)
NewRatio = 8
SurvivorRatio = 8
PermSize = 16777216 (16.0MB)
MaxPermSize = 67108864 (64.0MB)
PS Young Generation
capacity = 58458112 (55.75MB)
used = 6423064 (6.125511169433594MB)
free = 52035048 (49.624488830566406MB)
capacity = 524288 (0.5MB)
used = 98304 (0.09375MB)
free = 425984 (0.40625MB)
capacity = 524288 (0.5MB)
used = 0 (0.0MB)
free = 524288 (0.5MB)
PS Old Generation
capacity = 313262080 (298.75MB)
used = 118485440 (112.99652099609375MB)
free = 194776640 (185.75347900390625MB)
PS Perm Generation
capacity = 40894464 (39.0MB)
used = 29614816 (28.242889404296875MB)
free = 11279648 (10.757110595703125MB)
From JMX Console - listThreadCpuUtilization() shows TP-Processorxxx at the top of the list.
about 200 or so threads total
These threads all have the same makeup:
Thread: TP-Processor26 : priority:5, demon:true, threadId:75, threadState:RUNNABLE, threadLockName:null
1) Set -Xms and -Xmx to the same size - this prevents the JVM from asking the OS for more memory, which can case the system time to go up.
2) Set the NewSize to 1/3 or 1/4 of your heap size. With a heap of 512MB, set -XX:newSize=150MB and -XX:MaxNewSize=150MB.
If the server is dedicated to running the application server, I would set heap to 1200MB and NewSize to 300MB. Those are usually good starting numbers for a system that has 2GB RAM.
Implemented each of the suggestions in 1 and 2 and also went up to 1024/250 with really no noticed improvement in the ratio. I'll keep tweaking but I'm wondering, does everyone else see a similar ratio between user and system cpu time?
Interesting there are no other takers on this ... Nobody else is seeing high system time (1:1 with User cpu time)?
I separated apache and Jboss to different physical boxes and the problem followed jboss/jvm. So under load, apache is truely snoozing...
I'm looking into things like strace, oprofile, gdb, etc. now to try and figure out what is happening inside the JVM.
Might also consider JRockit as it supposedly has some nice features for peering inside the JVM ... Anyone have any thoughts on what the best approach would be for getting insight into the JVM or the JBoss microkernel?
Of course it could be the code too ... My developers are resisting.. :)
Getting a few thread dump might help pinpoint where in the application the problem is. I recall helping one customer whose system was showing 80% CPU usage when no sessions were active. Turns out that the application code had an infinite loop, and three of the requests were in that loop. Doing several dumps in a row an noticing that those three threads were still in the same code after a couple of minutes helped to pinpoint the problem.
We did take a closer look at some of the TP-Processorxxx threads and that got the developers thinking it might be their code. We've now isolated it to our application code.
Now the real fun begins...
Found the problem in the custom application code. Apparently there was a system call to get the system time which was executing 1000's of times per transaction ... :)
Thanks for all the assist/ideas ....