5 Replies Latest reply on Apr 7, 2013 7:39 PM by bwallis42

Best way to configure ModeShape & Hibernate Search so that indexes are stored in Infinispan

hchiorean Apr 2, 2013 12:46 PM

As you may know, ModeShape has been offering for some time now the possibility of configuring its indexes to be stored in Infinispan, via Hibernate Search's Infinispan integration. However, we've been having an increasing number of issues with this feature, both when running in a clustered or a non-clustered environment.

The purpose of this thread is to accumulate the knowledge gained so far on this topic and hopefully figure out a way to properly & reliably configure the Infinispan caches, so that indexing works in various setups.

ModeShape - Hibernate Search integration

ModeShape uses Hibernate Search in a pretty simple way which is made up of 2 key areas:

defining the indexing name & structure - this is done via the https://github.com/ModeShape/modeshape/blob/master/modeshape-jcr/src/main/java/org/modeshape/jcr/query/lucene/basic/NodeInfo.java class
submitting indexing tasks (via Hibernate Search unit of works): https://github.com/ModeShape/modeshape/blob/master/modeshape-jcr/src/main/java/org/modeshape/jcr/query/lucene/basic/BasicLuceneSchema.java#L387

And that's pretty much it, the rest depends on how Hibernate Search is configured.

However, one very important aspect of the way indexes are defined, is that for each an every repository node that gets indexed, ModeShape will always use the same index (same index name).

Hibernate Search - Infinispan local cache configuration

It seems that even when running 1 local node, indexing can behave quite differently based on how the index caches are configured. There are 3 caches used for indexing: "index-data", "index-locks" and "index-metadata". What we've seen so far is that using various transactional combinations of those caches, can result in different issues (ranging from lock timeouts to index data corruption).

More information on this topic can be found here: https://issues.jboss.org/browse/MODE-1876

Hibernate Search - Infinispan clustered setup using multiple index writers

Problems with this setup are described in more detailed in issues such as https://issues.jboss.org/browse/MODE-1843 or https://issues.jboss.org/browse/MODE-1845.

Based on those and the fact that using the same index writer will always result in contention of the same index write lock (as a cache entry in Infinispan), it seems that the only possible way to cluster multiple index-writing nodes is to use a JMS master-slave configuration, where in effect only the master node will do the index updating. However, this may still exhibit the issues of a local, non-clustered node if the caches are not configured properly.

1. Re: Best way to configure ModeShape & Hibernate Search so that indexes are stored in Infinispan

clementp Apr 5, 2013 5:10 AM (in response to hchiorean)

Sounds to me that Modeshape should at the very least shard by the repository if not by the node type. That should reduce index update contention dramatically.
Actions
2. Re: Best way to configure ModeShape & Hibernate Search so that indexes are stored in Infinispan

rhauch Apr 5, 2013 9:40 AM (in response to clementp)

You should never have multiple repositories sharing the same set of indexes - always given each repository its own set of indexes. And depending upon the configuration, you may want each process to have its own indexes (e.g., if you're putting them on the file system).

Sharding by node type would be great, but there are two problems:

First, because a node has a primary type AND optional mixin types, a node can be explicitly declared to have multiple node types. Since we're only writing to the indexes the explicit node types (and not inherited), that still means that a node with a primary type and two explicit mixins would need to be written to 3 indexes (one for the primary type and one for each of the two mixins).

Second, queries all have criteria against the node types (since the FROM clauses dictate node type criteria), and when the query is planned we use cached knowledge of the node type hierarchy to determine all of the effective node types that each FROM clause applies to. For example, consider a query that tries to find some "mix:created" nodes, so out query looks something like this: "SELECT * FROM [mix:created]". In this case, the actual node type we're looking for is just "mix:created", but the effective node type includes all node types for which "mix:created" is a supertype. Just thinking about a few of the built-in types (see image below), the effective types for "mix:created" are: "mix:created", "nt:hierarchyNode", "nt:folder", and "nt:file". So even considering just a few of the built-ins, that single node type criteria just got turned into 4 criteria.

With even modest use of inheritance, the number of effective node types for most node types are likely to be higher than 4.

With a single index, we're actually able to very effectively handle any numbrer of effective node types. We turn each FROM clause into a single Lucene OR criteria of all the effective node types. But changing to multiple indexes (one for each explicit node type) would greatly complicates this. All of these now-simple Lucene criteria for node types would need to be replaced with a separate query per effective node type, and we'd have to join all the results since Lucene doesn't do joins between indexes.

So, yes, it is possible to shard a number of different ways (by property, by node type, etc.) to improve the performance of writing/updating the indexes (which happens only when content is changed), but any sharding will greating complicate our query execution, which will slow down all queries (even when no changes are made to content).

With the current approach, we take the performance hit when content is changed and get very good performance for all query executions.

Now, there may be ways of sharding the indexes in other ways. For example, we could shard by node keys, so each index "owns" a non-overlapping set of nodes. Most of our keys contain UUID for the identifier part, and UUIDs have good spread. This certainly has the potential to reduce write lock contention, but given that many updates (e.g., session.save calls) involve multiple nodes, it's possible that a single update would require updating multiple indexes. So the risk of write lock contention goes up.

Another option might be to shard by path (or some other similar characteristic), trying to leverage the fact that a session will likely update nodes in one or a small number of "areas". Of course, this has a number of complicating factors, including how to break down the "areas" (e.g., determine the sharding regions), and how moves and renames will be handled (either could move a node from one index to another; we don't currently need to handle this, because we just update the node with the new path to overwrite the existing document).

I'm open to ideas.
Actions
3. Re: Best way to configure ModeShape & Hibernate Search so that indexes are stored in Infinispan

clementp Apr 5, 2013 4:04 PM (in response to rhauch)

I misspoke, I was referring to workspaces, not repositories. Can it be the sharding criteria?

As for the issue with nodes, when I looked at Hibernate's implementation of IndexShardingStrategy and it does seem like you "can" make it do the joining of results for you (instead of looking at every shard, just look at a couple of "relevant" shards). I understand though the need for duplication of nodes across indices in order for inheritance to work so that's probably not a good idea.

Hence, what about sharding by workspace (our use case is heavily customer dependent so might as well split the data up at that level) and also expose the index shard configuration (e.g. 5 shards per workspace). I don't think right now you can configure the number of shards that hibernate search uses. I understand the scoring consequence of sharding in lucene and the duplication of metadata.
Actions

4. Re: Best way to configure ModeShape & Hibernate Search so that indexes are stored in Infinispan

bwallis42 Apr 7, 2013 9:23 AM (in response to hchiorean)

I have a simple setup with the indexes in non-replicated local cache on a single appserver. modeshape version is 3.2-SNAPSHOT from a few days ago (last Thursday I think). JDK is 1.7.0_17 on mac osx. (more details in this thread)

I get an error after some time of running a load test

{code}22:48:31,073 ERROR [org.hibernate.search.exception.impl.LogErrorHandler] (Lucene Merge Thread #3666 for index nodeinfo) HSEARCH000058: HSEARCH000118: Exception during index Merge operation: org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: No sub-file with id .prx found (fileName=_5vo.cfs files: [.tii, .fnm, .tis, .frq, .fdt, .nrm, .fdx])
          at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:509) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
          at org.hibernate.search.backend.impl.lucene.overrides.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:58) [hibernate-search-engine-4.2.0.Final.jar:4.2.0.Final]
          at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
Caused by: java.io.FileNotFoundException: No sub-file with id .prx found (fileName=_5vo.cfs files: [.tii, .fnm, .tis, .frq, .fdt, .nrm, .fdx])
          at org.apache.lucene.index.CompoundFileReader.openInput(CompoundFileReader.java:157) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
          at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:96) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
          at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:116) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
          at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:696) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
          at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4238) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
          at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3908) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
          at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
          at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456) [lucene-core-3.6.2.jar:3.6.2 1423725 - rmuir - 2012-12-18 19:45:40]
{code}

The configuration I'm using is with the default indexing setup (synchronous) and the transaction mode for the three index caches is NONE. (full config is attached). The error doesn't occur until some thousands of nodes have been created. I don't yet have a simple test case that reproduces this (this is in a client/server load test app).

Is this a known problem or should I try to create a test case for this?

thanks

standalone-modeshape.xml 25.6 KB

5. Re: Best way to configure ModeShape & Hibernate Search so that indexes are stored in Infinispan

bwallis42 Apr 7, 2013 7:39 PM (in response to hchiorean)

In a persistence setup where I'm using write through cache loaders for all the data and indexes, do I need to persist the indexlocks cache? Is this data needed for a restart after a shutdown or crash?
Actions

Go to original post