Poor Lucene index query performance on a single string property
mbachmann Jan 19, 2017 5:19 PMFor my company's product, I'm looking at switching over to using the Lucene index provider due to some random (unfortunately not reproducible as of yet) index corruption issues that have been seen with the local index provider.
In doing so and looking at performance, I was noticing that the query performance for VALUE indexes on simple string properties was not performing well. I threw together a small test application which inserts a 1000 nodes into a type with an index on a single string field populating the field with a unique sequence of values so that the cardinality should be really high. The test then does a 1000 searches using that field as the constraint with a random value. Test java code, modeshape config, and CND attached. The results were as followed (Modeshape 5.3.0, Windows 7 x64, Java 1.8.0_91) :
Inserted 1000 in 580 ms.
Searched 1000 nodes 1000 times in 57314 ms.
Deleted 1000 in 25 ms.
Throwing this under a profiler, all of the time is spent in IndexReader.document call within ConstantScoreWeightQuery.java. Looking at this code it seems that this query is basically doing a linear search of the index and forcing lucene to instantiate a full document for each entry. Following that logic and digging into the code, I changed the EQUAL_TO case in LuceneQueryFactory.stringFieldQuery from:
case EQUAL_TO: return CompareStringQuery.createQueryForNodesWithFieldEqualTo(stringValue, field, factories, caseOperation);
to just using the build in Lucene TermQuery:
case EQUAL_TO: return new TermQuery(new Term(field, stringValue));
The results running with this change are:
Inserted 1000 in 627 ms.
Searched 1000 nodes 1000 times in 1327 ms.
Deleted 1000 in 24 ms.
So a 40x improvement, which seems pretty good, and at least from other testing seems to provide correct results.
So my question is: what am I missing? CompareStringQuery looks like it might be necessary for implementing things like regular expression matching which are not implemented inherently by lucene, but the simple string equality case seems like it should be devolved onto Lucene. Also a little confused by comments in ConstantScoreWeightQuery saying that it always returns a weight of 10f, so there maybe some interaction with modeshape's overall query optimizer of which I'm not aware.
Any insight would be appreciated,
Matt
-
simple-modeshape-config.json.zip 690 bytes
-
types.cnd.zip 258 bytes