8 Replies Latest reply on Aug 21, 2012 6:19 AM by dan.berindei

    Data integrity after cluster merge

    mattw

      Hi,

       

      I'm looking at the issue of data integrity in Infinispan when clusters merge. I have 3 machines which are broken in to a cluster of 2 and a cluster of 1. When they remerge, I see a Merged Event and Data Update events for any changes. If there is an update, each cluster tells the other of the update and we end up with a single key point to two different values depending on which machine you are on (They effectively swap values). This means we have inconsistent data in the cache.

       

      I've search the web and the forums and seen this is an issue and there is no inbuilt solution expected until at least version 6. I have also seen suggestions to clear the local cache before the clusters merge.

       

      So far, I have listened for the @Merged event and cleared the cache at that point (on the smallest cluster only), the updates are made but only in one direction so we end up with a consistent cache. However, if there is any delay to the clearing, it will clear the entire cache accross the cluster resulting in data loss.

       

      So my question is, how do ensure I clear the local cache only (without affecting any others) before the data updates are shared between the rejoined clusters?

       

      Many thanks,

      Matt

       

      PS, this is using Infinispan 5.2.0 Alpha and JGroups 3.1 over TCP

        • 1. Re: Data integrity after cluster merge
          galder.zamarreno

          If you have a reference to the cache, you could call the following in order to make sure that the clear was only local:

           

          cache.getAdvancedCache().withFlags(Flag.CACHE_MODE_LOCAL).clear()

           

          I'm not 100% sure about the timing of the @MergeEvent wrt the state transfer of data though.

          • 2. Re: Data integrity after cluster merge
            mattw

            HI there,

             

            Thanks for the response, that works a treat.

             

            I had done the same by getting the data container (getDataContainer()) and clearing that but your way looks better.

             

            I think you're right about the state transfer, it works fine on caches with few records. However, if I stress my test setup with 90000 records across 2 caches (One replicated, one distributed) with backing stores, I sometime see inconsistent caches where records are missing. I assume the clear is still processing when the state transfer starts (and are therefore wiped). Is there any way to delay the state transfer? Or force it to happen at some known point?

             

            Are there any other recommended ways of handling the state merge?

             

            Thanks again for your help,

            Matt

            • 3. Re: Data integrity after cluster merge
              galder.zamarreno

              If you focus only in the distributed cache for a sec, you could implement @DataRehashed annotated method and do the clear when the event.isPre=true. This even gets sent before any rehashing happens.

              • 4. Re: Data integrity after cluster merge
                mattw

                Thanks for that suggestion but I am seeing the @DataRehashed event after a @ViewChanged event so I don't think we can use that...

                 

                Matt

                • 5. Re: Data integrity after cluster merge
                  galder.zamarreno

                  Did you try doing the clearing in @ViewChanged then?

                  • 6. Re: Data integrity after cluster merge
                    mattw

                    The @ViewChanged event gets fired on any view change, including new nodes joining. I don't want to be clearing the cache on a new new node joining...Only on the merge.

                     

                    Cheers,

                    Matt

                    • 7. Re: Data integrity after cluster merge
                      dex80526

                      I think this is also related to  ISPN-1586, which is specifically for replication cluster. FYI.

                      • 8. Re: Data integrity after cluster merge
                        dan.berindei

                        @dex Yes, ISPN-1586 is certainly related conceptually - you could regard each joiner with preexisting data in the cache store as a partition re-joining the cache. However, handling these situations is quite different.

                         

                        @Matthew The local node won't send any data before @DataRehashed ends. On the other hand the local node will still accept data from the other partitions, so your clear() call may delete more than necessary.

                         

                        I think we don't have a proper solution for your problem at the moment. You could probably do it by hacking JGroupsTransport to call your code before the other listeners, so that our internal listener only ever called after you do the clear(), but that has its own share of risks (and may stop working in 5.2).

                         

                        Please create an issue for this, although we certainly won't have full conflict resolution for merges in 5.2 we should give you the chance to just clear the local data store before sending or receiving anything.