1 2 Previous Next 17 Replies Latest reply on Jan 7, 2011 7:05 AM by manik

    Node Starting / Rehash Inconsistency

    shane_dev

      After reviewing the code and analyzing some logs, I noticed that there are a couple of inconsistencies that are tripping us up.

       

      Let us assume that we have an existing group with nodes A, B, and C. Now we have a new node join: D.

       

      At this point D is both starting up and participating in a rehash (distributed mode).

       

      If a request is made on A for an entry that is being copied from B/C to D, then it is possible that D may be in the destination list. The problem is that if the entry was originally on B and C, the list will now contain B, C, and D during the rehash. At least, this is what we are seeing in the logs. Even though we only use 1 backup (numOwners="2"), we are seeing 3 nodes in the JGroups destination. There are two issues here. One is that since D has not finished starting up** the remote request will timeout (our biggest issue) and return a request ignored response. The second is that it may not have the entry yet (not such a big deal). I realize it may simply return an uncertain response in this case.

       

      This brings me to my questions:

       

      1. Should a node end up in the destination list if it is still starting?
      2. Should a node end up in the destination list if it is still rehashing?
      3. Should a rehash be considered part of the start up process? I assume that the cache is still 'starting' because the rehash is not complete.

       

      I also have one other thought: ReplicationTask.call()

       

      I noticed that each of the destinations are handled in a loop such that we send a message to one, check the response, and if necessary try the next. Is it possible to send messages to all of the destinations at once and then return the first acceptable response? In this case, the timeout would not bother us as we would already have received an acceptable response from an alternatate node.

       

      ** [InboundInvocationHandlerImpl] Cache named [XYZ] exists but isn't in a  state to handle invocations.  Its state is INSTANTIATED.


      Thanks,

      Shane

        1 2 Previous Next