I'm experiencing the same. Actually, for every insert, it iterates through all the entity beans, just to find that they are not dirty. It takes *time*.
These creates are done from a session bean. If a set the transaction-attribute on the session bean method to Required, the insertion stops after a while, and triggers a timeout, causing the transaction to rollback (naturally).
Changing the transaction-attribute to NotSupported, the data finds its way into the database, but still, slowly.
Why is this?
It might be (I'm speculating) that this is due to the code that ensures that fk's are there when you need them. (I think it was added after 2.4.3)
Does this problem occur if you insert each bean/row in a separate transaction? I would expect the code I'm thinking of to be bypassed in that case.
When doing smaller transactions, the problem doesn't occur. It appears to me that the complexity of adding entity beans with relations in a single transaction is exponential on bean count, which I think is rather bad...
For completeness, this is the kind of debug message that I get in the thousands when trying this:
[org.jboss.ejb.plugins.cmp.jdbc.JDBCStoreEntityCommand.Status] Store command NOT executed. Entity is not dirty: pk=1673
I'll take a look at this.
RH 20020105 from CVS, W2K, Oracle 8.1.7.
I am currently working on converting a set of about 3000 records from an external table into an Entity Bean extent inside a SLSB method with transaction attribute "Required" (may be it should be "Required New" but that's not relevant for the discussion below).
From looking at the debug log I found the following behaviour, without really knowing the JBoss internals (transaction manager default configuration):
1) The JDBCStoreCommand.execute generates a disastrous amount of debug output by printing "...NOT updated." for each bean instance it looks at. Getting rid of that would be easy.
2) The root cause of this large amount of debug output is the really interesting issue:
Whenever a new bean instance is about to be created the instance cache is "flushed". By this I mean the process where the storage manager walks through the list of all cached instances of the bean class, fetches the list of modified fields, and, if it detects a modification, writes the outstanding modifications to persistent store.
The problem with this approach is, that it leads to an O(mn) behaviour, where m is the number of instances in the cache and n is the number of operations, that trigger a flush on the cache.
Imagine this scenario: the instance cache is configured to hold 10000 instances. Now a bulk insert of 5000 new instances (as often occurs when importing data from an external table) is started. On the first insert JDBCStoreCommand.execute has to check 0 instances, on the second insert one instance, on the third insert two instances and so on. All in all the execute() method will be called (1 + 2 + 3 + ... + n-1) times where n is the number of inserts, resulting in (n-1)*(n-2)/2 calls to execute. In the case of 5000 new instances this means that
there are 4999*4998/2 = 12492501, ie. more than 12 millions calls to execute().
The number of lines in the debug output generated is growing with the square of instances inserted. At some point a single insert creates more debug output than the logging file (default size = 0.5 MB) can hold. This results in a very interesting behaviour: the logging file has to be rotated on every single insert (if the number of instances is large, this will even happen several times per insert).
The "net gain" of all this is that I still have not succeeded in importing my 3000 records. First I had to set the transaction timeout to half an hour. Now the import chokes after 2700 records complaining about "Expected number of changed rows not 1" (or similar, from memory).
But that's a different story...
The core problem is the strategy for detecting modified instances: instead of looking at each instance and deciding whether it has to be updated on each instance cache flush, the modified instances should be registered in a "modified instance" list, so that no extra effort is needed to detect "dirty" instances. Instead the set of modified instances would already be known.
If the behaviour described above is not changed large instance caches that should presumably speed up operations, will bring at least the insert performance down to a crawl even on very modest numbers of inserts (hey, 5000 instances are nothing in the real world of tables with millions of records, in particular not in the Oracle world).
I also would like to suggest adding a stress test to the test suite that inserts at least 10000 records in a single transaction. If this takes longer than a few minutes we have a problem here (I'd rather suggest 50.000 rows...).
Question: is there a configuration option to get rid of the behaviour described above (transaction configuration option A, B, C or D)?
Any other buttons to press or controls to adjust?
Appreciate any help on this.
As a temporary workaround, how slow is it if you disable the debugging output?
AFAIK there is no way to turn off the checking behavior.
Some degree of checking behavior is necessary to make sure fks get updated before deletes.
I think this can be fixed by adding to the txEntityMap not when the entity is enrolled in the tx but when it not be determined to be in synch with the db. In other words, after every method call, see if we can determine if it is dirty or not. If dirty or indeterminate (cmp1.1 with no isModified() method) put it in the map, otherwise remove it. This will add constant time to each method call but eliminate this quadratic behavior.
I commented out the debug message and recompiled.
Now the time it takes to insert 3464 rows is 225 seconds (JUnit timing) compared to more than 700 seconds with this log statement.
The performance is now 15,4 rows per second. It is tolerable in my current situation, but in some respect misleading.
Because of the quadratic behaviour there are still rather tight hard limits on scalability. For instance, inserting 5000 rows might already be intolerably slow.
To get a feeling for this I did some experiments. The outcome is (the formatting is destroyed but initially this is a table):
records time (seconds) factor (n) factor (time)
1000 38 1 1
2000 103 2 2.7
3464 225 3.4 5.9
Computing the exact quadratic equation is left as an exercise :-).
BTW: The HW used is PIII 733 MHz, 256 MB, with JBoss running on the same server and the Oracle DB running on a Sun-250 with 1 GB RAM.
I think the enclosed diff fixes this. Could you check?
I'm thinking about a couple of other refactorings here, but I'd like to make sure this works first.
I've applied your patch and have seen an increase in performance for inserts. On my P3 733MHz w/ 512MB RAM, running Linux with MySQL, I can create 2000 entity beans in under 30 seconds.
Now, I am having a disastrously slow time in accessing the beans. My test beans are all local beans and I have a session bean that is commanded to modify every one of the beans... The session bean does a find all, then walks the returned collection modifying some attributes. My tests have revealed that it takes about 1 second to do:
MyBean b = (MyBean) iterator.next();
Now, when I only have 100 beans here is the output of my session bean:
processed 10 beans in 1172ms
processed 10 beans in 959ms
processed 10 beans in 800ms
processed 10 beans in 698ms
processed 10 beans in 565ms
processed 10 beans in 513ms
processed 10 beans in 359ms
processed 10 beans in 278ms
processed 10 beans in 224ms
processed 10 beans in 101ms
processed 100 beans in 5679ms
See how it gets faster as time goes on? Does read-ahead come into play here?
Is there some way we could configure the queries (like findAll) to load all the beans at once and not update them when they are retrieved from the returned collection?
Anyway, going from 55ms average read/write when you have 100 beans to 1sec average read/write when you have 2000 beans does not bode well for scalability...
I'm not at all convinced that I have configured JBoss for maximum performance, so, any additional tips would be greatly appreciated.
PS. as an aside, if I purchase the CMP 2.0 docs (for JBoss 3) do I get all the updates that will inevitably occur as JBoss 3 evolves?
The answer to the above problem lies with ReadAheadCache.
When it goes to the first entity bean it does a query for entity beans between 0 and 99, adding them to its preload data.
Well, that is all well and good, but when it goes to the second one, it does NOT find the second one in its preloaded data! Well, it now does a query from 1 to 99 and adds them to the preload data.
Well, again, that is all well and good, but when it goes to the third one, again, it does NOT find the third one in its preloaded data! Well, it now does a query from 2 to 99 and adds them to the preloaded data.
You can see where I'm going. Eventually, the number of items shrinks thereby resulting in the increased efficiency as it goes down the line...
Now, why can't it find it previously preloaded data? Does it have something to do with the transaction?
I purchased the docs and I've already been notified once of a free update.
Ok, after further experimentation, I discovered that since I was doing the transaction only at the entity bean level, it was clearing the cache in between transactions. Well, I moved the transaction up to the session bean level and things are now staying in the cache.
This led me to another problem with read ahead. There is no way to set it. I mean, there is, but the code ignores the attribute and just uses 1000, which is way to high. I've found that 50 works pretty good for me.
Here is a patch to ReadAheadCache.java that uses the 'page-size' attribute correctly.
However, I am still noticing some peculiar behavior. It only takes me 30 seconds to iterate through 9000 entity beans (via a findAll()) setting some attributes, on the server side. But my client (who called the session bean) waits 2.5 minutes to complete the call. Is the extra time because the transaction is being committed?