Having done a fair amount of Hibernate training and consulting, I thought I might post this mini-guide on some common tuning things. These are the top 8 things I tend to find when tuning Hibernate/EJB3 apps. I am going to talk about this predominantly in Hibernate terms but most of it applies to EJB3 persistence (some of the detached object/etc stuff is a bit Hibernate specific) as well. All of it obviously applies to JBoss's EJB3 persistence implementation (which is Hibernate) either by default or through the use of "extensions" (is it an extension if it predates the spec?).

  • appropriateness of Hibernate for the task
  • fetch strategies
  • set logic vs iterative logic
  • named vs adhoc queries
  • cache abuse
  • version columns/locking strategy
  • bah...DB2 and friends
  • flushing your performance

appropriateness of Hibernate for the task

It may surprise you to read that sometimes the first thing out of my mouth to customers is "I really am not sure you should be using Hibernate for this". Too often I see code that could take seconds as a few "update/where" and "insert/from" type statements written as a set of iterative loops. Now this bad code was traditionally written as PL/SQL with cursors but is now written as Java with Hibernate. The problem is that if the PL/SQL code was really slow, the Java code is ridiculously slow (do to interprocess/network communication). Generally the "reason" is that they "want to reuse our business logic for batch processing". The problem is that improper cost/benefit analysis is done. If you "reuse your business logic" and it takes 10 hours for your batch process to run when it could have taken 10 minutes...was it worth it? This comes up frequently in computing. It is "performance vs maintainability" and traditionally the answer is "maintainability and buy a faster processor. The problem is that this is all about "how we do IO" and "how many processors are used to do the task". When we talk about millions or billions of records being processed the decision matrix tends to shift in favor of hard to read Oracle-specific SQL statements with bizarre "parrallel(x)" syntax running efficiently in the database. Using cursors and iteratively processing these in loops performs poorly and bound to a single processor. Performance is even worse when doing remote communication by using a Java resultset for sequential access. Hibernate is really appropriate for transaction processing where we have short sweet reads, writes, and updates happening over many threads. The scale is in concurrency and pause time rather than in records processed.

fetch strategies

Many users define their fetch strategies inside of their meta-data. In Hibernate, you can put it in your HQL query instead which allows you to tune on a use-case basis. In H3 when you run an HQL query the mapping file's fetch strategy is ignored for each association and they are loaded lazily by default. Speaking from a performance perspective only, your DAO classes should reflect "use cases" such as "findCustomersAndOrdersByXXX()". This is a great advantage over definition defined fetch strategies as I may sometimes only want customers and other times I may want customers and orders. It depends on whether I am about to show the customer his address or about to show a sales person his detailed order history. You may wish to separate data access tiers, but should not do so in a way that is naive.

     from Customer c left outer join fetch c.orders

set logic vs iterative logic

This item overlaps the stuff mentioned above in the appropriateness of Hibernate issue. Unless I have already loaded a set of data into my session cache, iterating through a set of data and updating a property is often inadvisable. Meaning I should not load a customer, loop through his orders, have a condition and call setStatus(STATUS.FOO). I should write this using the "executeUpdate()" EJB3/H3 method. This lets me write a SQL-style "update" statement (as well as delete, etc).

   update customer c set c.status=:status where c.status=:prestatus and c.territory=:territory

named vs adhoc queries

Hopefully most of you use Hibernate with an appserver. If you do not, I know where you can get a great one for free! Your appserver should supply a connection pool. Your connection pool should supply a statement cache. When you use named queries, you optimize the use of the statement cache. Meaning Hibernate will pre-parse the HQL, create the statement and have it ready to go for named queries. For an "ad hoc" query this happens every time on the fly. For EJB3 this means using the appropriate annotations for named queries. This can make a difference in performance, but the main advantage is that any HQL errors are discovered at startup rather than on the fly.

cache abuse

Cache is King, but it can also be king of bad performance. I generally recommend turning cache on at the end of your project as part of your performance testing effort. However, most developers fundamentally misunderstand how Hibernate uses cache. Cache is not, by default, on at all. Turning second level cache on, does not effect HQL/QbE/QbC/SQL queries. Caching with a low cache hit ratio will result in decreased performance due to de-optimization of the JVM garbage collector. Caching forces memory fragmentation. Caching expedites major and full garbage collections. Second level cache, primarily affects Session.get/find/load operations as well as association traversal (but less so). Query cache affects queries, but has to be turned on for specific queries. It is keyed by the query and its parameters. This is good for things like list boxes. The only automatic caching Hibernate does by default is Session caching. Meaning when you access entities through the session they are kept referenced in the session. If you access the same entity twice through Session.get/find/load then it is retrieved from the session cache.

version columns/locking strategy

Most developers don't understand how most transactional databases work (see DB2 section for a more in-depth discussion). When you have a transaction in the data you update is a "copy" of the data. It is only saved over the original when you commit. However if two concurrent transactions update the same data, the default behavior is "second one wins". Much of the time, we want "first one wins, second one is informed". Hibernate can help. One way it can do this is to keep a copy of every field read and execute a statement like this:

update thing set description='something in the db' where id=1 and date='10/10/01' and description='this here thing'

Meaning all of the original values are in the where clause. In order to do this, the session must remain open which precludes the use of detached objects (for all intents and purposes). However, Hibernate also supports the use of version or timestamp columns. In order to use them, you must have a version or timestamp column in your table. You should have one anyhow, because even if Hibernate isn't your exclusive access method, you can do this without Hibernate:

update thing set description='something in the db', set version=2 where id=1 and version=1

Even if the above DML SQL was executed twice, the second query would return "0 rows updated" which Hibernate uses (via JDBC) to generate an exception to the second updating thread. This is an efficient way to implement optimistic locking and it of course works with detached objects. Obviously you can adapt legacy code to do this as well. While some data do require a LockMode.UPGRADE or "select for update", much data do not. The great thing is that Hibernate manages the version matching/logic for you, you just catch the exception and decide what to do about the conflict.

DB2 and friends

With IBM's DB2(/Informix) you get to do fun stuff like monitor lock contention. The big issue with DB2 is that it doesn't support something called Multi-Version Concurrency Control (MVCC). What is it? Well this article gives a high level explanation, but basically you work on a copy of the data in your transaction rather than on the real record; when the transaction commits it is written back. DB2 supports row or page locks basically. Which means in the normal/default/usually-optimal "read committed" isolation level DB2 will get a read lock on rows read in a transaction when Hibernate issues select statements. This could result in very slow performance if you have lookup tables that are not/cannot be cached for instance. Your options though suck. You can degrade the isolation level to "read uncommitted" (you do this in the datasource definition in JBoss). Since Hibernate uses a "write behind" strategy, you can used this as a "poor man's" MVCC. Meaning so long as you do not actually flush to the database, you're going to achieve most of what MVCC tries to achieve. However, you have to be aware of the flush mode behavior of Hibernate. There is a lot more to all this and if you have to use DB2 (you'd have these problems with or without Hibernate) and you don't know all of this already then you probably need help if you're going to achieve a high level of concurrency. With all of that being said, while DB2 costs a lot less than Oracle, I can't say enough about how much more optimized for transactional computing Oracle is. However, Oracle is not your only option, this is not really a cola war. You actually can choose from several databases with MVCC or forms of MVCC including: Microsoft SQL Server, MySQL and PostgreSQL. DB2 and Informix are basically left out in the cold here with regards to "popular" databases without MVCC. In fact glaring are the benchmarks published by MySQL. I'd put to you that MVCC support is probably more important than a lot of other concerns in transactional computing (as opposed to batch processing and OLTAP type stuff). If licensing was the reason DB2 was chosen over Oracle, you may want to consider one of the other database choices. Aside from Oracle, I have had the most experience with MySQL, but they have great service and support and with what you save on licensing you can probably buy some fatter hardware and possibly exceed the performance of the others for far less money.

flushing your performance

Finally, a lot of developers call session.flush excessively. Note that the default flush mode (auto) in Hibernate flushes data whenever a query is executed or whenever a transaction is committed. This makes calling flush normally unnecessary. It also means that if you do query, update, query, update, query, commit that flushes will happen at least 3 times. If we do query, query, query, update, update, commit and change the flush mode to COMMIT (instead of auto) then we could achieve the same thing with only 1 flush. Flushing is pretty expensive (it is when all the DML is sent to the DB). Economizing this would be good from a performance perspective. However, if you have a lot of junior developers on your team (or careless people) then it is pretty hard to work in a FlushMode.COMMIT world. Meaning if I update thing.name='foo' and then execute "from thing where name='foo'" I will not get back the thing I just updated as a result. You have to manage your own transactional space with regards to your own updates in a FlushMode.COMMIT world. However, it is good from a performance perspective (meaning I wouldn't recommend it much of the time). In looking into this you may also want to look into the jdbc batch_size setting.

While in most apps your performance tuning should start with your persistence layer, it should not stop there. Moreover, your settings in the persistence layer touch your appserver configuration (statement cache size), JVM tuning (garbage collection/heap size) and more. These are just a few of the biggies that I run across in my travels that you should consider, there is obviously more. We touch on all of this in the Hibernate 3 advanced training (except for the DB2 stuff) which JBoss and our partners offer.