6 Replies Latest reply on Oct 26, 2006 10:38 AM by jbirkenmaier

    Shunning not working

    jbirkenmaier

      Hi,

      I have 2 JChannels both of which have shun enabled. One channel is for JBoss itself and the other is for my PojoCache. I have a cluster consisting of {A,B,C}. I unplug the network cable for C. Each node detects the loss of the other node(s). I wait about 60 seconds then plug the cable back in. What I want is for {A,B} to continue in their own cluster and C to continue in its own (separate) cluster. In other works, A and B reject C's effort to rejoin the cluster.

      What is happening is that C rejoins A and B and my cache is merged thus becoming corrupt with C data. In other words, node C is not shunned by A and B and nodes A and B aren't shunned by C. Each node is notified of the change via a merge view event. I still need to receive this notification that a node tried to rejoin the cluster. I just want the rejoin to fail without touching the cache.

      Here is an excerpt from one XML config file:


      jboss:service=Naming
      jboss:service=TransactionManager
      mycom.prox:type=Connector,transport=proxsocket
      org.jboss.cache.JBossTransactionManagerLookup
      OPTIMISTIC
      true
      true
      ${jboss.partition.name:DefaultPartition}
      ${prox.cluster.mode:LOCAL}


      <UDP mcast_addr="228.1.3.4" mcast_port="48868" ip_ttl="64" ip_mcast="true"
      mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
      ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
      loopback="false" use_local_host="true"/>
      <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/>
      <MERGE2 min_interval="10000" max_interval="20000"/>
      <FD_SOCK down_thread="false" up_thread="false"/>
      <FD timeout="1000" max_tries="8" down_thread="false" up_thread="false" shun="true"/>
      <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false" down_thread="false"/>
      <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/>
      <pbcast.STABLE desired_avg_gossip="700" up_thread="false" down_thread="false"/>
      <VIEW_SYNC avg_send_interval="60000" down_thread="false" up_thread="false" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
      <FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>


      5000
      <!-- Configure Marshalling (Both of these should be true together according to TreeCache Docs)-->
      true
      true

        • 1. Re: Shunning not working
          jbirkenmaier

          I am also thinking that AUTO_RECONNECT is true when I want it to be false. Since PojoCache doesn't provide access to its JGroup, how would I gain access to this to be able to change it?

          • 2. Re: Shunning not working
            jbirkenmaier

            I have gained access to the JChannel for the PojoCache and executed the following:

            channel.setOpt(Channel.AUTO_GETSTATE, Boolean.FALSE);
            channel.setOpt(Channel.AUTO_RECONNECT, Boolean.FALSE);

            However, this doesn't seem to have any effect. When I reconnect the network cable (for node 192.168.69.122), the following is logged:

            08:40:00,175 INFO [dragoneyes] (UpHandler (STATE_TRANSFER)) New cluster view for partition dragoneyes: 3 ([192.168.69.122:1099, 192.168.69.230:1099] delta: 1)
            08:40:00,176 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Merging partitions...
            08:40:00,176 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Dead members: 0
            08:40:00,176 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Originating groups: [[192.168.69.122:34399|2] [192.168.69.122:34399], [192.168.69.230:32821|2] [192.168.69.230:32821]]
            08:40:05,753 INFO [CoreTreeCacheListener] (UpHandler (STATE_TRANSFER)) viewChange MergeView::[192.168.69.122:34402|3] [192.168.69.122:34402, 192.168.69.230:32823], subgroups=[[192.168.69.122:34402|2] [192.168.69.122:34402], [192.168.69.230:32823|2] [192.168.69.230:32823]]

            The cache is still being merged for a node that is supposed to be shunned. Any ideas? Thanks.

            • 3. Re: Shunning not working
              belaban

              Shunning *always* rejoins C. To prevent this:
              - disable shunning
              - Remove MERGE2 from the stack

              But I'm not sure I recommend this, unless you have a way of killing C, because {A,B} and {C} will see each other's traffic and discard it

              • 4. Re: Shunning not working
                jbirkenmaier

                Actually, that is the plan: when node C detected the merge, it would exit with a status code of 10 thus causing a restart of JBoss. C would then join the cluster as a new node. That part of it WAS working. When I commented out the MERGE2 entry in the xml file, it stopped the cache from merging just like I wanted but it had the unwanted effect of no longer sending the merge event. So node C no longer restarts.

                You see, I used the merge event to tell me that the network was reconnected so that the node could restart itself. Is there some way to still get a notification (of any kind) without merging the cache?

                • 5. Re: Shunning not working
                  belaban

                  Well, you can catch the viewAccepted() callback and check whether the argument is a View (no merge) or a MergeView (merge). In the latter case, do what you need to do. To prevent the cache from handling the merge itself, you may need to subclass TreeCache and override the callback

                  • 6. Re: Shunning not working
                    jbirkenmaier

                    What we ended up doing was to turn off the Merge completely for the cache and tap into the JChannel that JBoss uses to detect when the network goes down and comes back. By using an HAMembershipExtendedListener attached to the ClusterPartition MBean we get notification when there is a membership change in the JBoss cluster. Each node then decides whether to restart or not. This works just fine and we get no cache corruption.

                    Thanks for your help.