could you help me understand your architecture first. I get that you're indexing data via Hibernate Search. Where are the entries which you're indexing stored? Is this Hibernate Search indexing Hibernate ORM entities (stored in a relational database), or are you storing entries in Infinispan as key/value pairs, and index these? Or are you considering a custom storage integration (you don't have to use Infinispan or Hibernate ORM).
Generally speaking, remember this:
- indexed fields can be retrieved as-is from the index using Projections, but this requires setting the option Store=Yes on the @Field annotation on each field you might want to project.
- Hibernate Search only needs to be able to "load" the original entry (from either Infinispan or the database) when you don't use Projections; with a pure Projection query the underlying data storage is not touched
- a powerful combination is to use Hibernate Search with a RDBMs storage, mapped via Hibernate ORM, and enable 2nd level caching so that you get the load performance of Infinispan but with the simplicity and reliability of the RDBMs transactions
On point 4, I can't really predict that. Infinispan would probably be approximately one order of magnitude faster, but only when configured correctly and I remember from previous posts from you that there were some performance problems. This might be faster indeed as you'd have less moving parts, but I'd rather identify what's wrong with your Infinispan configuration? There are plenty of good diagnostic tools for Java which should be able to point you in the right direction, or share the reports with us.
Entries which are indexing are stored presently in Infinispan DataStore and then Indexing is done by Hibernate Search.
The present deployment scenario if 4 Distributed Cache nodes running with 80G Memory each 50G for data and 30G for Indexes. There are C++ clients as well as Java clients interacting with this cluster through respective hotrod clients. All requests are sent as queries as this requires to be done based on different fields for 80% cases. For 20% cases we use put which synchronizes the data immediately with DB.
There are 40 server worker threads in each node so over all 160 threads. There are 50 client requests happening in parrallel. The intent is to do approximately 8 billion reqests per day for which we require around 60k per sec performance from system but we are presently getting around 15k only. The number of entries are 240 million.
When we did same query using Hibernate Search directly we are getting around 14k req per sec from a single client thread. Hence are inclined to use Hibernate search direclty and wrap around in our custom interfaces. This requires a major design change at present hence can you suggest which parameters to look for getting atlease half of speed which Hibernate Seach provided .
If we go down hibernate route there is configuration in Lucene to use Infinispan as back end store. So without MassIndexer we are hoping that we can preload the Index information from DB and avoid costly IndexWrite Operations.
More than performance we would be saving 50% of memory as well if we use Store=Y with compression enabled as suggested and query everything using projections only . Do you see any problems in this approach ?
Since some applications are C++ we might not be able to use Hibernate second level cache as suggested.
Thanks, I still have many questions but this helped.
I don't know what it happening in your system for Hibernate Search "alone" to be significantly faster than when loading results from Infinispan; I have two theories, but only you can confirm them by connecting a profiler to your system and find out what it is doing.
My first theory is the IndexWriter activity, which we already discussed; you really shouldn't experience index writes when doing just read operations. This could be a bug in Infinispan, but for us to figure that out we would need an example entity and an example of how you're invoking the API. If you could produce that, please open a JIRA and we'll be happy to look. It is very likely due to some safety net which triggers for your specific usage of the API, even if it turns out that it's just a mistake in your usage don't worry about that there's no problem in closing a JIRA, but without some examples I can't figure this out.
The second theory is that each of your queries produce a storm of get operations. How many results does one query generate normally? A long standing issue in Infinispan was the missing capability to fetch multiple values at once, and a large query result would have been affected by that. The great news is that this is now fixed, if you try 7.2.0.Final or 7.2.1.Final this problem is solved; you just have to upgrade and the query engine transparently applies the optimisation as needed.
By going down the "Hibernate route" as you say, do you imply using a relational database for storage? I guess not since you say that is not possible when mentioning C++ clients; but I don't understand what you're thinking to do then, as if you remove Infinispan you will have to code a custom API anyway for these C++ clients? You might as well implement your code technology using Hibernate, and expose that service using whatever API suits you most, for example it's very easy to create a REST endpoint with a couple of annotations when running on WildFly (just an example).
So since you say you have to keep the Hot Rod C++ clients happy, is your plan to keep Infinispan? You could do that by keeping Infinispan as reference storage, but write all your queries to use Projections only so that they hit the index exclusively.
Thanks for the details
The results of query are single result only. All are pointed queries 99% so second scenario is ruled out.
We upgraded to 7.2.0.final already and have seen 3x improvement compared to 7.0.2 but we are still 5 x below our target
We didn't see many IndexWriters after the upgrade so something else can be an issue will raise a bug as suggested and provide as much details as possible
thanks! and feel free to profile it more, share some profiling logs or screenshots that was interesting.
Another thing, did you notice the index performance tuning options in this table?
It focuses mostly on improving write (indexing) performance, but some of those options can affect performance for reads too.
Following Jira is raised for the same
It contains queries executed and the data fields as well. Raising it as Query and configuration issue as both seem to be leading to this problem