1 Reply Latest reply on Aug 13, 2010 2:47 AM by johan_andries

    On the dangers of the semantics of "numOwners"

    sriramsrinivasan

      I decided to generalize this earlier query about numOwners as a separate topic.

       

      If you configure "numOwners" to 2, say, Infinispan treats it as the maximum number of copies of a key. If the number of servers drops down to less than that number, that is OK by Infinispan; it isn't going to start rejecting inserts if it is unable to create the stated number of replicas.

       

      The problem here is that if the network gets partitioned, each partition is just fine chugging along on its  own. This status will continue even if the network problem is resolved, since the code doesn't seem to handle  JGroups merge events. I don't see how Infinispan could handle merges anyway, because clients of each partition have been told their transactions are correct, and a subsequent merge may render invalid one or more clients' earlier assumptions.

       

      If this assertion is wrong or mischaracterizes the work of Infinispan and JGroups, I apologize, and look forward to being corrected.

       

      Otherwise, using Infinispan as a data store (as opposed to a cache) is really dangerous. In spite of all the mechanisms for ensuring data consistency, it is rather trivial to get inconsistency due to  network partitioning, however temporary.  The scheme is oriented towards availability at all costs.

       

      In addition, using Infinispan as a cache is dangerous if it front-ends a non-distributed data store (file system, or sqllite), because if a partition is repaired, you may have two separately updated copies of data.

       

      It seems to me that the root problem is that of split brains due to network partitioning, but the numOwners part of it is relevant because a failure saying "couldn't find a quorum to make the update" would be an indication of some massive failure. Otherwise an application would chug along happily without suspecting anything.