1 Reply Latest reply on Feb 15, 2015 10:56 AM by djchapm

Is Infinispan a good fit for me?

cheetah05 Feb 13, 2015 8:09 AM

At the end of each day I get 2 million records of EVENT_ACTIVITY from about 100 sources. Each record has an EVENT_ID to relate it to an EVENT. I don't actually get any information on the EVENT or know what the events are, but by combining all the EVENT_ACTIVITY records that have the same EVENT_ID I can represent/construct an EVENT.

The EVENT_ACTIVITY objects can have different fields but the majority of fields can be found on all EVENT_ACTIVITY objects. If I were to guess now, I would say that each EVENT_ACTIVITY (about 40 fields) would probably be max 1000 bytes in size.

I need to expose this data for querying/processing - I would like all processing to be done within my space, so there is no extracting of the data to process elsewhere.

I've been told to maximize the performance of processing/querying all of the EVENT_ACTIVITY for the last x days (where x could be anything e.g. 1d, 20d, 90d) (so we are talking potentially 100s of millions of records) . However, in terms of querying, people are going to want to search/pivot the data by ANY of the fields where the "virtual" concept of EVENT is the subject (results will be grouped by EVENT), so I do not want to hinder this.

I have to keep at least the last 2 years of data online but have reasonable access (within a few hours) to the older data.

Given that we are talking nearly a 1TB worth of data, it won't all be in memory. It would be nice to have the last 90days of EVENT_ACTIVITY in memory and then rest persisted to a disk-based store but still accessible from querying. I would like the query interface to know that something is in memory and query from there if it can, falling back to disk for only the days outside the 90days.

So based on the information I have given above, is Infinispan a good fit?

1. Re: Is Infinispan a good fit for me?

djchapm Feb 15, 2015 10:56 AM (in response to cheetah05)

We're not working with that much data, but we're working with much larger pieces of data and also have the problem of requiring full query capabilities. We load about a million to 1.5million records a day at about 8 to 10 Gigs. Once data is in cache, distributed or local, the querying is sweet, returns very fast. But getting the data in a disributed cache with querying enabled is problematic. Especially with live updates. Every time you add something, or add batches, our distributed cluster slows to a crawl and we begin seeing tons of timeouts on either cache access or the indexing timing out. I wouldn't risk it at this point if I were you. We've spent over a year trying to get this to work from 5.1.6 all the way to 7.0.0.Final. If you find a way to get what you need using only key/value store then it's a great solution. But if you want querying capability on this scale I think it's just not ready yet.

Dan C.
Actions