0 Replies Latest reply on Feb 10, 2010 3:57 AM by styyeung

    node is suspected and then new cluster is re-formed

    styyeung

      It was found that the JBoss cluster at our production will drop out a suspected member/node regularly and thus the 2-node cluster instance
      will be removed.  Right upon the cluster removal, cluster discovery process will start immediately to rebuild the cluster.


      Each cycle of cluster removal and rebuild will consume some memory for message state, and thus out of memory error
      will occur eventually (around 3 days after server restart by experience).

       

      The problem does not happens at testing environment and but only happens at production.

       

      Our problem is very similar to the one described in the following link:
      http://lists.jboss.org/pipermail/jboss-user/2007-April/050997.html

       

      Any help will be greatly appreciated.

       

      Version of software:
      ===================================
      jboss-4.0.5.GA
      jdk1.5.0_09
      Linux version 2.6.18-53.el5 (brewbuilder@hs20-bc1-7.build.redhat.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14))


      Log content at cluster node 1
      ===================================

      10:36:22,391 INFO  [TreeCache] viewAccepted(): [node1:39395|16] [node1:39395]
      2010-02-09 10:36:22,391 [INFO ] MessageDispatcher up processing thread viewAccepted(): [node1:39395|16] [node1:39395] [org.apache.commons.logging.impl.Log4JLogger]
      10:36:25,285 INFO  [TreeCache] viewAccepted(): [node1:39395|17] [node1:39395, node2:51906]
      10:36:25,354 INFO  [TreeCache] locking the subtree at / to transfer state
      10:36:25,355 INFO  [StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
      2010-02-09 10:36:25,285 [INFO ] MessageDispatcher up processing thread viewAccepted(): [node1:39395|17] [node1:39395, node2:51906] [org.apache.commons.logging.impl.Log4JLogger]
      2010-02-09 10:36:25,354 [INFO ] MessageDispatcher up processing thread locking the subtree at / to transfer state [org.apache.commons.logging.impl.Log4JLogger]
      2010-02-09 10:36:25,355 [DEBUG] MessageDispatcher up processing thread generated the in-memory state (97 bytes) [org.apache.commons.logging.impl.Log4JLogger]
      2010-02-09 10:36:25,355 [DEBUG] MessageDispatcher up processing thread returning the associated state (4 bytes) [org.apache.commons.logging.impl.Log4JLogger]
      2010-02-09 10:36:25,355 [INFO ] MessageDispatcher up processing thread returning the state for tree rooted in /(1024 bytes) [org.apache.commons.logging.impl.Log4JLogger]
      10:36:38,702 INFO  [eugmcpartition] Suspected member: node2:51899 (additional data: 19 bytes)
      10:36:38,703 INFO  [eugmcpartition] New cluster view for partition eugmcpartition (id: 16, delta: -1) : [node1:1099]
      10:36:38,704 INFO  [eugmcpartition] I am (node1:1099) received membershipChanged event:
      10:36:38,704 INFO  [eugmcpartition] Dead members: 1 ([node2:1099])
      10:36:38,704 INFO  [eugmcpartition] New Members : 0 ([])
      10:36:38,704 INFO  [eugmcpartition] All Members : 1 ([node1:1099])
      2010-02-09 10:36:38,702 [INFO ] MessageDispatcher up processing thread Suspected member: node2:51899 (additional data: 19 bytes) [org.jboss.ha.framework.server.HAPartitionImpl]
      2010-02-09 10:36:38,703 [INFO ] MessageDispatcher up processing thread New cluster view for partition eugmcpartition (id: 16, delta: -1) : [node1:1099] [org.jboss.ha.framework.server.HAPartitionImpl]
      2010-02-09 10:36:38,703 [DEBUG] MessageDispatcher up processing thread dead members: [node2:1099] [org.jboss.ha.framework.server.HAPartitionImpl]
      2010-02-09 10:36:38,704 [DEBUG] MessageDispatcher up processing thread membership changed from 1 to 1 [org.jboss.ha.framework.server.HAPartitionImpl]
      2010-02-09 10:36:38,704 [DEBUG] AsynchViewChangeHandler Thread Begin notifyListeners, viewID: 16 [org.jboss.ha.framework.server.HAPartitionImpl]
      2010-02-09 10:36:38,704 [INFO ] AsynchViewChangeHandler Thread I am (node1:1099) received membershipChanged event: [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,704 [INFO ] AsynchViewChangeHandler Thread Dead members: 1 ([node2:1099]) [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,704 [INFO ] AsynchViewChangeHandler Thread New Members : 0 ([]) [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,704 [INFO ] AsynchViewChangeHandler Thread All Members : 1 ([node1:1099]) [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]

      2010-02-09 10:36:38,704 [DEBUG] AsynchViewChangeHandler Thread purgeDeadMembers, [node2:1099] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,704 [DEBUG] AsynchViewChangeHandler Thread trying to remove deadMember node2:1099 for key DCacheBridge-DefaultJGBridge [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread node2:1099 was removed [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread The list of replicant for the JG bridge has changed, computing and updating local info... [org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread ... No bridge info was associated to this node [org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread trying to remove deadMember node2:1099 for key jboss.ha:service=HASingletonDeployer [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread node2:1099 was removed [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=true, viewID=-703916242 [org.jboss.ha.singleton.HASingletonSupport]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread trying to remove deadMember node2:1099 for key HAJNDI [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread node2:1099 was removed [org.jboss.ha.framework.server.DistributedReplicantManagerImpl]
      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread replicantsChanged 'HAJNDI' to 1 (intra-view id: -703916242) [org.jboss.ha.framework.server.HATarget]

      2010-02-09 10:36:38,705 [DEBUG] AsynchViewChangeHandler Thread End notifyListeners, viewID: 16 [org.jboss.ha.framework.server.HAPartitionImpl]
      10:36:41,534 INFO  [eugmcpartition] New cluster view for partition eugmcpartition (id: 17, delta: 1) : [node1:1099, node2:1099]
      10:36:41,534 INFO  [eugmcpartition] I am (node1:1099) received membershipChanged event:
      10:36:41,534 INFO  [eugmcpartition] Dead members: 0 ([])
      10:36:41,534 INFO  [eugmcpartition] New Members : 1 ([node2:1099])
      10:36:41,535 INFO  [eugmcpartition] All Members : 2 ([node1:1099, node2:1099])
      2010-02-09 10:36:41,534 [INFO ] MessageDispatcher up processing thread New cluster view for partition eugmcpartition (id: 17, delta: 1) : [node1:1099, node2:1099] [org.jboss.ha.framework.server.HAPartitionImpl]

       

      Log content at cluster node 2
      ===================================

      -------------------------------------------------------
      GMS: address is node2:51878
      -------------------------------------------------------
      10:23:30,311 INFO  [TreeCache] viewAccepted(): [node1:39395|9] [node1:39395, node2:51878]
      10:23:30,376 INFO  [TreeCache] received the state (size=1024 bytes)
      10:23:36,083 INFO  [eugmcpartition] Suspected member: node2:51875 (additional data: 19 bytes)
      10:23:36,084 WARN  [GMS] checkSelfInclusion() failed, node2:51875 (additional data: 19 bytes) is not a member of view [node1:39398
      (additional data: 19 bytes)|8] [node1:39398 (additional data: 19 bytes)]; discarding view
      10:23:36,085 WARN  [GMS] I (node2:51875 (additional data: 19 bytes)) am being shunned, will leave and rejoin group
      (prev_members are [node1:39398 (additional data: 19 bytes) node2:51875 (additional data: 19 bytes) ])
      10:23:36,907 INFO  [STDOUT]