3 Replies Latest reply on Jun 17, 2004 6:19 PM by belaban

    Cache is not synchronized in Cluster on reconnect

    silviomatthes

      We're testing with TreeCache regarding cluster-abilities.

      The cache is not synchronized to a node that had no network connection and reconnects to the cluster.

      - We have a cluster of 2 linux machines, jboss 3.2.4 (final) with Treecache that was delivered with jboss 3.2.4.
      - Cache is configured as SYNCRONIZED and REPL_SYNC

      Thats our scenario:
      1.) Both machines (say A and B) are connected to each other.
      2.) a Tx is started on machine A and committed that puts 3 members in the cache.
      3.) the printdetails-function called on every machine shows the same values. fine.
      4.) We unplug the network on machine B.
      5.) Both machines recognize the disconnect (viewAccepted():...)
      6.) We start a similar Tx as in 2.) on machine A which does a 'put' on 2 of the 3 members in the cache (so we modiy 2 members) and commit the Tx.
      7.) We reconnect machine B to the network.
      8.) Both machines recognize each other and build a cluster.

      --> the printdetails-function called on every machine shows DIFFERENT values!
      So the cache is not re-synchronized to machine B! Why?


      We noticed 2 WARNing messages:

      [NAKACK] [<machine B>] discarded message from non-member <machine A>
      [NAKACK] [<machine B>] discarded message from non-member <machine A>

      These messages appear on machine B after network-reconnect and BEFORE the "viewAccepted()"-message that says that the cluster is rebuild.
      Maybe these messages should be processed in order to resynchronize the cluster?

        • 1. Re: Cache is not synchronized in Cluster on reconnect

          Since the cache has started on Machine B already, it won't initiate another state transfer from other members when re-joining the group since it can be expensive operation. If you stop and start the cache on Machine B (say, from JMX console via MBean service), then it should sync up.

          But I will discuss with Bela on this maybe adding this as an option.

          Thanks,

          -Ben

          • 2. Re: Cache is not synchronized in Cluster on reconnect
            silviomatthes

            Hi,
            thanks for your answer. It would be nice to have such an option. Because otherwise we're getting problems with data inconsistency.

            To do a workaround with stopping and starting the cache in such cases automatically, we first should know when a machine has no connection to the cluster anymore to react to it.
            Is there some kind of function that is triggered when the clusternode-memberlist is changed (I mean when the viewAccepted()-message is displayed)?

            Thanks in advance,

            Silvio

            • 3. Re: Cache is not synchronized in Cluster on reconnect
              belaban

              What you essentially want is a state-merge function after e.g. a network partition. This is actually on the roadmap, but it involves asking you (the application) how to merge 2 (potentially) different substates back into one. We *cannot* just take the union of the 2 substates, because an application may want to do this differently.
              The final solution will definitely involve a callback into the application to resolve this, probably we also ship with some default strategies.

              Bela