1 2 Previous Next 18 Replies Latest reply on May 4, 2015 3:26 AM by prashant.thakur

Anyone in real world using large distributed cluster with querying?

djchapm Feb 15, 2015 11:24 AM

Hi,

Wondering if anyone out there has had success using infinispan distributed mode with lucene querying enabled beyond the simple case?

This doesn't apply at all to 'large' infinispan caches that do not use lucene querying.

We've had limited success doing this with a 2 node cluster and maybe 500K records, where our records are java objects with indexing enabled on many fields as well as fields in sub objects. When we start up, we load the cache using maybe 35 threads each loading 30K records at a time or something like that - we've tried different strategies with # threads, #records at a time, sync vs async, etc. It doesn't take too long before the cache begins seeing timeouts, or the replicated index cache begins getting timeouts. Once that starts there is no recovering in a live production situation.

So discussion topic is who has actually done this successfully. > 4 nodes distribution, > 2Gigs of data, Several Indexes defined on objects and aggregates.

My personal opinion is that any infinispan development on this has been on a much smaller scale, or the querying work is only done on caches that are preloaded and complete, or perhaps local cache only. Would love to see a test case where 4 nodes are all periodically updating a dist cache and running many queries at the same time on a million records or more with significant lucene search indexing. And not all on the same machine.

If you do feel you have this issue "Solved" then your strategy is like gold and would love to hear about it.

Thanks,

Dan C.

1. Re: Anyone in real world using large distributed cluster *with querying*?

prashant.thakur Feb 16, 2015 2:03 AM (in response to djchapm)

Hi Daniel,
we have started going down the query path for our latest project using custom developed Hibernate Store functions .
The benchmarking activity we are starting this week and would share the results once done.
With get/replacewithversion itself we were able to achive around 60k req/sec with 4 nodes cluster, don't know much about queries as of now but would be sharing our results once done.
Indexing as such we have not used till now but your results seem to scare us.
Can you share some of the benchmarking results you already had from your 4 nodes cluster to get an initial idea how the things work with Query ?
Actions
2. Re: Anyone in real world using large distributed cluster *with querying*?

gustavonalle Feb 16, 2015 9:31 AM (in response to djchapm)

I guess the timeout issues you are experiencing are the ones described on Unable to acquire lock exception ? Have you tried the latest Infinispan 7.1.0.Final as suggested?
Actions
3. Re: Anyone in real world using large distributed cluster *with querying*?

djchapm Feb 16, 2015 9:48 AM (in response to gustavonalle)

Hi Fernandes,

We have not upgraded to 7.1.0.Final yet. We released to production this weekend. However - I did note that we've been always upgrading since 5.1.6, and are now on 7.0.0.Final. And there was enough change from 7.0 to 7.1 that we couldn't just slam it in. Anyway - this is not the point of this discussion. I'm wanting to hear and find out about the details of at least one solid success story given the parameters described.

To Prashant - I would say querying is very fast and great, only problem is dealing with a lot of updates to the cache from different nodes while trying to maintain cluster integrity. LOCAL is rock solid too of course. My next thought for our system is to possibly greatly reduce the amount of queryable information in our cache items and perform our queries after the fact using local lucene, or to get rid of the query capability all together and force a model that allows me to get what I need through a series of gets and then massage the data. Brute force anotherwords.
Actions
4. Re: Anyone in real world using large distributed cluster *with querying*?

sannegrinovero Feb 16, 2015 2:34 PM (in response to djchapm)
Hi Daniel, you're correct our tests involving "real time" updates with a mostly-automagic configuration are small sized.

We have some interesting evolution ideas, but while I'd love to get you involved in such discussions, that's for future versions.
In terms of configuration examples which work fine today at non-trivial scale, there are three categories:
[1] Those who disabled real time indexing, and get it done via batch processing (like the MassIndexer)
[2] Configure ad-hoc nodes to serve as index writers; this requires some knowledge of the backends and directory provider details and some integration coding with e.g. your server management tooling
[3] Use the async indexing configuration (not necessarily async cache!) Option 3 really shines only if you upgrade to 7.1.0.Final.

The remaining category [4] being "let it guess the right configuration" but because of various guarantees it needs - such as needing a strong consistency - it is quite inefficient. This model is safe but doesn't scale well today.

For any of the 4 categories the upgrade to 7.1.0.Final is highly recommended: for a long time we haven't worked on this area, and have been very busy actually on this very subject of efficient indexing in the timeframe between 7.0 to 7.1. More great things will come soon, as it still is a hot topic and more users are asking for this now. I'd love to know exactly what backwards compatibility issue you had?

For very large scale we expect you'll need solution 1 or 2 - and I know people are using it in such configurations with good results, but I can't share the references.
But I wouldn't consider your figures "large scale", so that should work fine since 7.1.0.Final even without changing architecture / configuration... although YMMV as while we tested a similar setup, we haven't tested your setup.

I agree it's hard to get those figures to work with 7.0 - especially without using any of the tricks I described above as categories 1 to 3. I'm sorry this wasn't fixed before: I'm contributing as I'm "one of those needing it", but I mostly use it with ad-hoc nodes [as in option 2]; only recently I got interested in making it work in a cluster of symmetrically configured servers.

The problem you're facing is this: the automatic configuration favours out of the box safety but is very inefficient, and when Infinispan gets overwhelmed with the many additional RPCs it creates, that escalates into more complex issues. The super-tuned configurations scale well but are very complex to configure - and unfortunately there is no middle-ground solution today.

The upgrade to Infinispan 7.1 fixes some of the efficiency issues, improving indexing speed of orders of magnitude; that version is actually tested in such sizes.
If you can't upgrade, I would suggest using [1]. If that's not an option because of application requirements to keep the index in synch at all times, then use [2].
Actions
5. Re: Anyone in real world using large distributed cluster *with querying*?

jeetu Feb 19, 2015 2:05 AM (in response to sannegrinovero)

Hi Sanne,
Thanks for the response and options.
while I understand that upgrading to Infinispan 7.1 is the way to go, will it be possible for you to show us that, the improvements in 7.1 indeed will work for our kind of scenario, that Dan has explained in his original post. Let me know, I can provide you all the information needed for you to quickly come up with a test case.
I would not want another embarrassing moment in production or in our load test, when we see the issues re-occurring after the upgrade.

Would like your expert help on this. We would like start this early for next release so that we don't fire fight at the end, when we don't have time or options to work this out.

Regards
Jithendra
Actions
6. Re: Re: Anyone in real world using large distributed cluster *with querying*?

sannegrinovero Feb 19, 2015 5:17 AM (in response to jeetu)

Hi Jithendra, I'm sorry I can't promise that 7.1 resolves all your issues. To do that, I'd need to run an exact replica of your code, on your servers, and with your same data.

All issues I've seen you reporting are symptoms of "excess load" on the servers; considering that 7.1 is orders of magnitude more efficient at indexing, so it will definitely help, and the difference being strong it's likely to fix your case.
But we don't know if the improvements are enough for your use case, there is only one way to find out.

I'm surprised that you don't just try already. The migration from 7.0 to 7.1 should be relatively simple, am I missing something?
You're very welcome to ask on this forum for help to migrate: if you have specific issues please open a new thread for each issue you have, we'll be happy to help.
Actions
7. Re: Anyone in real world using large distributed cluster *with querying*?

jeetu Feb 19, 2015 8:21 AM (in response to sannegrinovero)
Hi Sanne,

We have been busy with the production release and getting some kind of simple configuration work for the infinispan querying. With the index clustering issues on the four nodes we have in production, we had to get back to using local caches and local index caches.

i have started with the upgrade today. I will ask for any issues i face during the upgrade.

I just wanted to be sure that this upgrade can handle our complex indexed objects and it's index distribution, when we load the cache.

Also should we go ahead and use, cache.putAllAsynch after this upgrade or still be a thread pool doing single puts synchronously?

If we are ok to use cache.putAllAsynch, can we rely on the Future<Response> of this method, for the completion of indexing and it's distribution in addition to the actual items put and distribution?

Depending on your expectation, how many of our inventory items can be put using the putAllAsynch at a time? I am attaching our pojo class and also the SearchMappingFactory class for you to give an idea about our pojo and the properties that we are indexing.

Thanks a lot for all the help.

Please bear with me for the number of questions and attachments. I just want to give you a real feel of what we are doing.

InventoryItem is the pojo that has one to many or one to one relation with the other sub pojos. We query the inventoryItem pojo.

Regards
Jithendra

InventoryItem.java 27.5 KB

EVCDetails.java 2.8 KB

InventoryLocation.java 7.8 KB

RelatedInventory.java 2.9 KB

TollFreeFeature.java 2.3 KB

ViprQciSearchMappingFactory.java 5.4 KB
Actions
8. Re: Re: Anyone in real world using large distributed cluster *with querying*?

gustavonalle Feb 19, 2015 9:11 AM (in response to jeetu)

What I'd suggest to get maximum indexing throughput is to carry on using multi-threaded synchronous puts, and configure the indexing worker as async:

"default.worker.execution" "async"

With this setting, by the time each thread returns from a put, data will be available in the cluster for lookups but not for search: the async indexing backend
propagates changes to the index at regular intervals (every 1 second by default). The lag between a 'put' and when data is effectively visible on the searches is 'F + delta' where
F is at most '1 second' and delta is the time it takes to commit the index, which can vary, but it should be in the 'seconds' order of magnitude rather than minutes.

If you know in advance how many entities your cache should end up with after the initial load, you can execute a 'count' at regular intervals to detected index termination:

SearchManager searchManager = Search.getSearchManager(cache);
final CacheQuery query = searchManager.getQuery(new MatchAllDocsQuery(), Entity.class);
int indexedEntities = query.getResultSize();
Actions
9. Re: Anyone in real world using large distributed cluster *with querying*?

jeetu Feb 20, 2015 5:49 AM (in response to gustavonalle)

Hi Gustavo,

Will these intermediate queries on the cache, while the cache is being loaded and index being updated, slow down the process?
We had this idea, but when we tried querying during the load, the queries took for ever to respond or some times, the whole process just hangs.

Let me know.

Regards
Jithendra
Actions
10. Re: Re: Anyone in real world using large distributed cluster *with querying*?

gustavonalle Feb 20, 2015 6:20 AM (in response to jeetu)

Hi, I don't expect a count executed every half a minute or so to affect your process. If you don't want to execute a query to count, here's a more efficient alternative:

IndexReader indexReader = Search.getSearchManager(cache).getSearchFactory().getIndexReaderAccessor().open(YourEntity.class);

int count = indexReader.numDocs()
Actions
11. Re: Anyone in real world using large distributed cluster *with querying*?

jeetu Feb 20, 2015 8:02 AM (in response to gustavonalle)

Thanks Gustavo.

Let me try this. I will update you on how it goes.

Regards
Jithendra
Actions
12. Re: Anyone in real world using large distributed cluster *with querying*?

djchapm Feb 20, 2015 12:53 PM (in response to djchapm)

Considering the direction this conversation is taking - I'm guessing that my orignal post as to real world implementations using distributed search of sizeable data with real time updates and multiple indexing, no one is really doing this successfully in a production scenario. Because I doubt too many people are already in production with 7.1.Final (since that is the 'solution' to this use case) which did have an upgrade to Lucene APIs that are not backwards compatible. And a complex solution like this wouldn't likely be an agile deployment for anyone running a sizeable distributed and queryable cache.

Unless it's highly stable data I guess that doesn't have to endure index upates/rebuilds.
Actions
13. Re: Anyone in real world using large distributed cluster *with querying*?

sannegrinovero Feb 20, 2015 1:22 PM (in response to djchapm)

Hi Daniel,
true others have diverged the discussion onto technical suggestions again.

But my answer above was specifically trying to keep the focus on explaining how - as far as I know - people used it in production so far.
True, I'm not aware of community builds who are not using one of 1)nightly indexing 2)dedicated indexing nodes 3)asynchronous indexing.

I listed these known working configurations above in more detail.

Consider also that any Search-only based platform such as Solr or ElasticSearch only give the asynchronous indexing option, so I don't think we're really missing out there. Those projects are very successfully deployed - although meant as Search only - and always are asynch so I don't think it's unreasonable for us to have suggested to use Infinispan with async indexing so far, or even give the alternative possibilities!

If you're not interested into indexed entries, you also have the option of disabling indexing altogether and use indexless queries.

That said, we also have some customers using the real time synchronous indexing capabilities, as this feature development which we've included in Infinispan 7.1 was "sponsored" by customer requests which have now been satisfied, so it's all backported into JBoss Data Grid as well and used in production successfully already, although obviously since very recently.

HTH
Actions
14. Re: Anyone in real world using large distributed cluster *with querying*?

jeetu Feb 26, 2015 2:34 AM (in response to sannegrinovero)

Hi Sanne,

I know i have diverged and gone into the technical suggestions with Gustavo. I just want to make sure that if we fail, we fail fast. So, wanted to get this upgrade done asap and take a call.

i have upgraded to 7.1.0.Final. There are some updates i have done on making the indexing asynch (option 3) with still using the DIST_SYNC cache mode. All index caches are REPL_SYNC.
Using the "org.infinispan.query.indexmanager.InfinispanIndexManager" class as the index manager. Looks like it's the default, which means that the option 2 that you have, is taken care right??

We could not yet think of off-line indexing, as there are updates that happen through the day and the index needs real-time updates.

I can see improvement in indexing time after the upgrade. That's in my local windows machine with two node cluster though. Need to check on how it performs on a real two machine cluster. Please clear me on the question above.

Regards
Jithendra
Actions