3 Replies Latest reply on Aug 29, 2013 3:08 AM by mattdarwin

    jboss 5.0.1: performance drops after one hour and strange GC behaviour

    mattdarwin

      We upgraded our software from jboss 4.0.5GA to 5.0.1GA and noticed that after one hour or so (or 90 minutes in some cases) performance drops dramatically.


      At the same moment, the garbage collector logs show minor garbage collection times jumping from 0.01s to ~1.5s, with the amount of the heap being cleared each time reducing from ~400MB before to ~300MB after.  (see GC viewer graph 1)

       

      GC viewer graph 1.png

       

      We think these are both symptoms of the the same underlying root cause.


      jvm settings are:

      -server -Xms2048m -Xmx2048m -XX:NewSize=384m -XX:MaxNewSize=384m -XX:SurvivorRatio=4 -XX:MinHeapFreeRatio=11 -XX:PermSize=80m -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+DisableExplicitGC -Djava.awt.headless=TRUE -DUseSunHttpHandler=TRUE -Dsun.net.client.defaultConnectTimeout=25000 -Dsun.net.client.defaultReadTimeout=50000 -Dfile.encoding=UTF-8


      The production environment is T5220 or T2000 hardware, SPARC, running a Solaris 10 local zone. jboss 5.0.1.GA, java 1.6.0_17  (32 bit)

       

      We set up a test environment consisting of 2 identical boxes, running the same software but one using jboss 4.0.5GA and one using jboss 5.0.1.GA.  They are VMWare VMs running on a HP ProLiant DL560 Gen8 with 4 x 2.2GHz Intel Xeon CPU E5-4620 and 64GB RAM.  Guest VMs are 4 vCPU, 4096MB RAM, CentOS 6.4. 


      We found that we could easily reproduce the problem in our environment.  The box which was running on 4.0.5 ran fine, but on jboss 5.0.1GA we saw the same strange GC behaviour.  Performance can't easily be tested in our environment since we don't have the same amount of load as production.


      We don't think it's a memory leak, since after each major GC, the used heap size returns to the same size:

      Screen shot 2013-08-16 at 13.12.35.png


      Analysing heap dumps taken pre- and post-apocalypse, we discovered the number of the following objects was different:

      org.jboss.virtual.plugins.context.file.FileSystemContext

      during the first hour, there are about 8 of them, and after the apocalypse hits, we see between 100 and 800.

      Other than that, the heap dumps look quite similar, and the top objects are either java or jboss objects (ie no application classes)


      Setting -Djboss.vfs.forceVfsJar=true on our test environment fixed the problem (ie the strange GC behaviour disappeared) but when applied in production, both the strange GC pattern and the performance problem remained - although the GC times didn't increase so much (to 0.3 seconds rather than 1.5 seconds).

       

      In our test environment, we then deployed the same software in jboss 5.1.0 and found the same behaviour as with 5.0.1. 


      So the conclusions at this point are that there is something happening in jboss 5.x around the 60 / 90 minute mark which has an impact on both garbage collection and performance.

       

      Edit:

      I found two questions here which could be related:

      https://community.jboss.org/message/562017#562017

      and

      https://community.jboss.org/thread/163261

       

      They both refer to the jboss web services stack, and suggest upgrading it to jbossws-native-3.3.1

       

      However my application doesn't use any web services...

       

      UPDATE:

       

      We tried upgrading the web services stack to jbossws-native-3.3.1, which fixed the problem in our test environment. However, when deploying to the next test environment (closer to the production environment), the problem was still there (albeit reduced).

        • 1. Re: jboss 5.1.0: performance drops after one hour and strange GC behaviour
          wdfink

          It might be that objects survive too many GC's in youngGen and are moved to the OldGen, here the GC is more expensive, also each move in survivor spaces will have costs.

           

          What might be an option is to use the CMS GC and increase the size of heap, specially the YoungGen that the objects are mostly out of date here.

          Also it might that the survivor space is too small to keep all survived objects, in that case the whole bunch is directly moved to the OldGen.

          I would check the behaviour in a test environment, you might use visualGC to see how objects are moved in survivor spaces, but be aware that visualGC will burden the watched VM.

          • 2. Re: jboss 5.1.0: performance drops after one hour and strange GC behaviour
            mattdarwin

            Hi Wolf-Dieter,

             

            Thanks for the answer.  What I don't understand is why this works just fine in 4.0.5GA, and indeed in 5.0.1 for the first hour. 

             

            Something changes (perhaps related to the number of FileSystemContext objects) after either exactly one hour, which means that lots more objects survive which previously weren't surviving.

             

            Is there any way of finding out what these objects are?  And supposing they are FileSystemContext objects, what then? How do we stop this from happening?

            • 3. Re: jboss 5.0.1: performance drops after one hour and strange GC behaviour
              mattdarwin

              UPDATE:

               

              We have resolved this by setting jboss.vfs.cache.TimedPolicyCaching.lifetime to a very large number equivalent to many years.

               

              This feels like a workaround for a bug in jboss.  The default cache lifetime is 30 minutes (see org.jboss.util.TimedCachePolicy), and we saw problems after either 60 or 90 minutes.

               

              The VFS cache implementation is CombinedVFSCache and I think it's using a TimedVFSCache underneath.

               

              It seems like a better fix would be to change the cache implementation to a permanent cache, but we've wasted enough time on this problem and our workaround will have to do.