2 Replies Latest reply on Aug 7, 2005 5:57 PM by nwc

    JBoss effectively hangs due to massive context switching

    nwc Newbie

      We have had a recurring problem over the past few months where every once in a while (1-5 times a month), our production app server (running JBoss) will become extremely slow, with vmstat reporting several hundred thousand context switches per second and maxed out cpus (dual processor machine) split about 25/75 between user/system processes.

      When it first starts happening, JBoss still seems to respond to most requests (albeit very slowly), but others just timeout, and eventually it reaches a point where nearly all requests timeout. At that point, it is impossible to do much more than to log in to the server and kill -9 the jboss process, and even that takes about 5 minutes to accomplish. As soon as the jboss process is killed, the entire system goes back to normal, and we can restart jboss and live happily again until it happens again after about a week of uptime.

      Here is what things look like right before restarting jboss:

      procs memory swap io system cpu
      r b swpd free buff cache si so bi bo in cs us sy id wa
      12 0 100952 16412 282520 1244400 0 0 0 0 106 306339 17 83 0 0
      13 0 100952 16408 282520 1244400 0 0 0 24 117 297674 23 77 0 0
      15 0 100952 16408 282520 1244400 0 0 0 0 108 336135 17 83 0 0
      16 0 100952 16408 282520 1244400 0 0 0 0 108 159630 20 80 0 0
      15 0 100952 16408 282520 1244400 0 0 0 0 116 176452 24 76 0 0
      14 0 100952 16408 282520 1244400 0 0 0 0 116 99453 27 73 0 0
      15 0 100952 16416 282520 1244400 0 0 0 24 117 96588 27 73 0 0

      And here is what it looks like right after killing jboss:

      1 0 91632 1409604 282520 1244436 0 0 0 0 171 130 0 0 100 0
      1 0 91632 1409604 282520 1244436 0 0 0 176 178 68 0 1 99 1
      1 0 91632 1409604 282520 1244436 0 0 0 0 125 38 0 0 100 0

      And here is what it looks like right when jboss has started back up again:

      5 0 91632 1139332 286124 1321444 0 0 0 0 329 636 76 2 22 0
      7 0 91632 1133860 286124 1321448 0 0 0 0 447 839 98 1 1 0
      4 0 91632 1133520 286124 1321468 0 0 0 0 1284 2436 96 3 1 0
      3 0 91632 1131932 286124 1321468 0 0 0 292 2388 4477 86 4 10 0
      6 0 91632 1131932 286160 1321468 0 0 20 0 814 1428 94 1 4 1
      2 0 91632 1121720 286160 1321468 0 0 0 0 272 504 97 1 2 0

      We are running JBoss 4.0.2, but this was happening on 3.2.3 also (we upgraded hoping this would go away, and it hasn't). The OS is "Red Hat Enterprise Linux ES release 3 (Taroon Update 5)", and the JVM is build 1.4.2_08-b03.

      One possibility that I have been suspicious of is our use of temporary files that are marked "deleteOnExit". We use them a lot, and store them all in one directory. As time goes on, their number only increases. The last time this happened, there were about 64,000 temporary files in the directory.