Weighing Replicated Mode Against Distributed Mode
thartwell Jun 2, 2017 12:59 AMCurrently we have an application that is using Infinispan in replicated mode. We started with replicated mode mainly for the ease of setup, but I'm not sure this mode is best for our layout.
Primary Node
Application 1 using
Cache A - read/write
Cache B - read/write
Cache C - read/write
|
Gossip Router - All communication from application 1 to application 2 happens over gossip router
|
Secondary Node 1 (SN1)
Application 2 using
Cache A - read only
SN2
Application 2 using
Cache B - read only
SN3
Application 2 using
Cache C - read only
...
SN80
Application 2 using
Cache Z - read only
We have less than 100 secondary nodes. Each secondary node only needs one cache's worth of data to be present. We use the cache on the secondary node as the only mechanism to access the data. If the data is not in the cache on the secondary node, it is not just a cache miss, it is an error in the application. This allows the secondary nodes application code to simply use Infinispan for data access and Secondary nodes never call into the primary node's application code for data. This is important, because application 2 cannot afford to be making remote calls for this information when it is needed at lookup time.
Secondary nodes are across a WAN (customer on premise application) in relation to the Primary Node. We've had issues with this Infinispan setup that are hard to debug. For example, Application 1 will fail to startup due to what seems to be state transfer. We were under the impression that the Primary Node, since it is doing all the writes, would always be the primary data owner, regardless of whether the Primary Node's application was not always on (e.g. during a restart).
The goal of this post is to gain a better understanding of the implications of this setup, and to then understand if we should migrate to a distributed mode setup.
Concerns with replicated mode
1. Is is possible, or even guaranteed, that the Primary Node running application 1 will not always be the primary data owner. If so, does that mean when application 1 starts up on the primary node, it will attempt to do a state transfer?
2. Is it possible that state transfers may fail entirely? We currently are seeing our application 1 getting into a state where the cache is empty for Cache A, for example, when it should not be, but it's not clear what might be the cause of the empty cache in this case.
3. The documentation is a bit vague, but it states that with more than 10 nodes, replicated mode is possibly not the best choice. Given that we are using gossip router and TCP between application 1 and all of the application 2, I would think replicated mode may not be the best for our use case.
4. If we move to distributed mode, Application 2 on the secondary nodes would be reaching across a WAN to talk to the Infinispan dedicated server cluster. Does the client/server model support the concept of the secondary nodes being clients, but having a copy of cache data locally so that lookups are from in memory and not from a remote machine? We need all of the data to be resident on the secondary node as it needs all lookups to be in memory for speed purposes.
5. When performing network tests regarding this current setup, we noticed that a Secondary Node is actually getting traffic on the wire for caches it is not concerned with. Is this by design? If so, this means with our secondary nodes, if there are 80 of them, will experience 80x the required traffic, since a secondary node will be getting traffic from all caches on the Primary Node, not just the single cache it is concerned with.
Thanks in advance for any insight,
Tom