6 Replies Latest reply on May 18, 2007 12:31 PM by gavin.king

    A performance concern with Init.instance(), a suggested work

    raghinii

      The short version:

      I claim that in Init.instance()

      this line:

      Init init = (Init) Contexts.getApplicationContext().get(Init.class);
      


      should be replaced with :
      Init init = (Init) Contexts.getApplicationContext().get("org.jboss.seam.core.init")
      


      because the current version calles Class.getAnnotation which in turns calls synchronized method and this results in poor behavior under heavy load.


      The long version:

      First some history (and a similar previous problem):

      Over the past few months I've been working with a team converting an existing webapp of moderate complexity to a JSF / Seam stack w/ Facelets.

      It's been in production now for about a month and a half.
      Running on jboss 4.0.5.GA w/ seam 1.2.0.patch1 Sun JSF RI 1.2 b4 (?)

      Before we put it in production had to do some performance tuning. What we started out with was about 10% as fast as the original app , which wasn't acceptable, but eventually we got to about 60%, which was fine.

      One undesirable behavior was that under moderately heavy load, when requests were slowing down from their normal ~300ms to 3-4 seconds, the distribution of request response time got very wide with some requests taking 5 seconds while other, identical, requests were timing out at 5 minutes.

      With some testing we tracked this down to the fact that org/apache/catalina/core/ApplicationContext.get() is being called something like 50K - 100K times per request in our app. Internally this call is synchronized.

      The solution we tried and eventually adopted was to edit the tomcat ApplicationContext code and replace the HashMap with a ConcurrentHashMap and to remove the synchronization in the get().

      This resulted in something like a 30-40% performance improvement (if memory serves), but more importantly it significantly narrowed the response time curve. That is, the response time for requests ,under load, was a lot more consistent.
      (note 1: this makes sense since having lots of threads contending for a lock can lead to starvation. In this case, the fact that any request has to "win" the lock 100K times before it succeeds makes it more likely that starvation will occur despite any anti-starvation attempts by the OS/VM)

      (Note 2: This change not strictly speaking a safe thing to do, but it works for us. (Tomcat 6 also takes this approach but presumably they've gone further in validating that the rest of the app is safe with the concurrent hash map semantics)).

      THE CURRENT ISSUE:

      Basically, in production under heavy load, we saw the same distribution of request response time. (measured by mod-log-firstbyte in apache). Bascially once the per-page response got to ~7-10 seconds / page some requests start to take 100 - 300 seconds (there's some timeout in the system that caps things at 300 seconds). The situation then often spirals out of control with active connections climbing to the 100s and inevitably resulting in an OOM.

      This time a thread-dump of jboss showed that many threads were waiting in to lock a specific lock in java.lang.Class.initAnnotationsIfNecessary().

      Looking at the stack traces it became clear that the Class in question was Init.class and that all these threads were calling Init.instance().
      ( Init.instance() is called ... everywhere... all the time).

      The specific line that leads to the eventual lock contention is this one:
      Init init = (Init) Contexts.getApplicationContext().get(Init.class);

      Which internally leads to a call to get the value of the @Name annotation for the Init.class so that it can then look up the instance by that name.

      It is this lookup of the Name of the Init component that causes the contention.

      Replacing the get(Init.class) with get("org.jboss.seam.core.init") bypasses this lookup and also that synchronized method.

      I've been running this modification on half of the servers for the past two days and they are showing MUCH more linear performance under heavy load. Basically when subjected to a stressor (GC pause, DB related pause, traffic spike etc ) the modified machines return to normal much more quickly (say... 20 -30 seconds compared to 5-10 minutes) and they haven't shown a tendency to spiral out of control towards OOM.

      I generally like the compile time checking that one gets from using the get(Class) version of this method, however in this specific case i think the loss of safety is worth it given the deep loop nature of this call.


      I'm curious as to what you guys think.

      Thanks,
      radu