7 Replies Latest reply on Nov 27, 2013 9:30 AM by jonathandfields

Poor query performance when data is not in cache

jonathandfields Nov 25, 2013 12:08 PM

Hi All,

I am experiencing very poor query performance when the data is not in the ISPN cache. When data is not in the cache, a query like

select [jcr:path] from [nt:unstructured] where prop1 = 'val1'

can take 10-20 seconds to complete the first time. It is returning perhaps 5K nodes out of a possible ~6M nodes. The second time the query returns in less than a second. So I am assuming the difference is that the data is not in the ISPN cache the first time.

I have about 6M nodes loaded into a repository. Only the leaf nodes have properties and there are around 6 properties per node. The configuration is MS 3.6/EAP 6/BDBJE class loader. Eviction is LIRS with max-entries=20000. JVM min memory is 4G and max 8G (20 G server). JBoss config file is attached.

Any thoughts as to why this is happening or how I can narrow down the bottleneck? I tried to profile the application using Visual VM (following quite a few suggestions on the web how to get past classpath issues) but was unfortunately unable to get that to work (sigh).

I have not tried it with another cache loader (e.g. JDBC). While that might help narrow down the problem to BDBJE, another loader is not a viable option for this application.

Thanks!

modeshape-standalone.xml 22.8 KB
standalone.conf.zip 987 bytes

1. Re: Poor query performance when data is not in cache

rhauch Nov 25, 2013 12:39 PM (in response to jonathandfields)

I suspect that this is happening because, per the JCR specification, a JCR query (whether JCR-SQL2 or other language) can only return in the query result those nodes for which the session has read-access. Therefore, even though ModeShape knows via the indexes which nodes satisfy the query, ModeShape still has to load the node to verify that it has read access because it has to use the session's access check methods.

IIRC Infinispan will always use a power of 2 for the actual maximum number of entries, so with 'maxEntries=20000' this would be 32768 (or 2^15). The likelihood of the 5K nodes in the result set being among the 20K in the cache is relatively small. Basically, the cache hit rate is pretty small. Once a query is completed and all results have been loaded/processed, if the exact same query is run again then the nodes coming back from the indexes will all happen to be in-memory within the Infinispan cache (at least since 5K < 20K max entries size); this is why the second query is fast. However, if the repository would be used such that all/most of those 5K nodes were to be evicted, I suspect your query will take longer again.

Any chance you can bump up the 'maxEntries' value? But even in this case, you're still going to be at the mercy of your cache hit rate for query results.

I have not tried it with another cache loader (e.g. JDBC). While that might help narrow down the problem to BDBJE, another loader is not a viable option for this application.

I suspect the JDBC cache store will be slower than BDBJE.

Clustering would help speed this up, but that may not be an option for you.
Actions
2. Re: Poor query performance when data is not in cache

jonathandfields Nov 25, 2013 2:09 PM (in response to rhauch)

Thanks for the quick response. I have always assumed that it would be slower if the nodes were not in the cache, but not this much slower. Until now, I have not used queries much, as access has been mostly navigational so the cache loads were amortized over the child traversals. Therefore I was surprised when I first encountered this, but when the second query returned so quickly I suspected ISPN cache loading was the culprit.

Unfortunately my data set has many millions of nodes and that number will be steadily increasing. Even if I could manage to fit everything into memory (e.g. by a distributed ISPN cluster), I would still suffer poor query performance for the first query, unless I preloaded the cache. But, preloading a distributed cache with that much data (the BDBJE data files consume ~10G) is going to be operationally difficult (restart/reboot).

I will need to experiment, but would you expect using the limit and order by keyword to help - or does MS still need to load all of the nodes in the result set in that case? This could be a possible workaround....

I am surprised that it takes ISPN that much time to load 5K (small) nodes..... That definitely limits its usefulness for large disk resident data sets. I am wondering what the raw BDBJE performance is for fetching 5K objects? I may need to experiment with this in order to have some kind of data management facility for the the entire (disk resident) data set, and then perhaps only manage much smaller "working sets" in MS that can fit into memory.

Thanks,
Jon
Actions
3. Re: Poor query performance when data is not in cache

rhauch Nov 25, 2013 2:15 PM (in response to jonathandfields)

I will need to experiment, but would you expect using the limit and order by keyword to help - or does MS still need to load all of the nodes in the result set in that case? This could be a possible workaround....

Yes, IIRC limit should help because it should be applied before we even try to load the nodes for access checks.

I am surprised that it takes ISPN that much time to load 5K (small) nodes..... That definitely limits its usefulness for large disk resident data sets. I am wondering what the raw BDBJE performance is for fetching 5K objects? I may need to experiment with this in order to have some kind of data management facility for the the entire (disk resident) data set, and then perhaps only manage much smaller "working sets" in MS that can fit into memory.

You might want to enable debug or trace logging (or debug with your favorite IDE) to see where the code is taking the longest. My analysis above is based purely upon your statements saying that the first query takes a long time while running the exact query again is faster. You may want to try running completely different queries that each return completely different nodes, as this might tell us whether it is only the first query (of any queries) that incurs the overhead.
Actions
4. Re: Poor query performance when data is not in cache

jonathandfields Nov 26, 2013 12:00 PM (in response to rhauch)

The behavior appears to be the same whether it is a result of a query returning thousands of nodes, or a node having thousands of children: if the nodes are not in the cache (first query), it takes 30-60 secs ; otherwise if the nodes are in the cache (second query) it takes around a second. I expect cache loading performance to be slower, but am surprised it is that much slower. I admit that my expectations are based upon an RDBMS query where the results are quick and consistent. I had hoped for the un-cached case to be roughly equal to an RDBMS query, and the cached case to be faster.

If I use limit, offset and order by in the query, then it does reduce the time to a reasonable amount as hoped. The first (un-cached) query with a limit 10 results returns in a few seconds, whereas the unlimited query which returns thousands takes over a minute. However, for child node traversals, it can still take well over a minute if there are thousands of un-cached children.

I've run some profiles using YourKit (could not get Visual VM to work) and that appears to confirm that the bottleneck is the cache loading. I have not yet determined whether that is in low level BDBJE "gets", or whether it is spread out over all the MS/ISPN/BDBJE calls. The call stack per node load is pretty deep, so it is conceivable that overhead accumulates over many thousands of node loads.

Another observation is that the node storage overhead appears to be significant when compared to rows in an RDBMS. I'm guessing this is because the parent/child pointers, type, mixins, property names and types etc all being stored in each node. That is not an issue for disk storage, but it does make it more difficult to fit the data into memory, which is what yields the best performance :-) If the node footprint was less, then optimal performance would be easier to achieve with large repositories.

Thanks.
Actions
5. Re: Poor query performance when data is not in cache

jonathandfields Nov 26, 2013 3:46 PM (in response to jonathandfields)

Update: It appears that for large amounts of data, that increasing the BDBJE log file size to reduce the number of log files can improve cache loading time. I increased it from 10M to 1G and the results for loading thousands of nodes seems to be in the 10-15 sec range now - more in the ballpark of what I was hoping to see. Perhaps the overhead of opening and closing files was contributing to the load time since the query results are probably quite spread out. FYI this is accomplished by placing a je.properties file in the BDBJE home directory with the je.log.fileMax property set.

I hope to now set up a distributed ISPN/BDBJE cache (single MS instance, multiple ISPN instances, all in EAP6) so I can get more data in-memory....
1 of 1 people found this helpful
Actions
6. Re: Poor query performance when data is not in cache

hchiorean Nov 27, 2013 12:15 AM (in response to jonathandfields)

Thanks for the updates Jonathan, this is valuable info.
Actions
7. Re: Poor query performance when data is not in cache

jonathandfields Nov 27, 2013 9:30 AM (in response to hchiorean)

The other pertinent configuration variable appears to be je.log.fileCacheSize, which controls the open file descriptor cache. The default is 100. If the OS allows you to have a large number of open files, increasing this should accomplish the same result - reducing the number of file opens and closes. Larger log files are going to result in more wasted space since the cleaner cannot delete them as quickly.

The JE configuration is summarized here: http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/EnvironmentConfig.html.
Actions

Go to original post