1 Reply Latest reply on Mar 27, 2006 2:38 AM by belaban

    100% CPU causes members view to contain multiple entries per

    eli.konky

      Hi all,

      I have a 2 nodes cluster. It works fine. But when the cpu in one of the members raises to 100% problems occur and the membership breaks (which i guess is ok). But when the cpu goes back down to normal (10%), the cluster mebership is not recovered correctly. Instead it goes into an endless loop of membership events, adding an already exsiting nodes to the view. so after a while I have a member list of 10 nodes.

      thanks
      Eli

      excuse the long logs

      This is the log file from the machine with the 100% cpu

      2006-03-26 16:49:13,650 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
      2006-03-26 16:49:16,150 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
      2006-03-26 16:50:10,384 INFO [org.jboss.ha.framework.interfaces.HAPartition.BUILD_2_0_0_45] Suspected member: elik:3638 (additional data: 18 bytes)
      2006-03-26 16:51:00,900 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, elik:3638 (additional data: 18 bytes) is not a member of view [SHIMI:1096 (additional data: 18 bytes)|6] [SHIMI:1096 (additional data: 18 bytes)]; discarding view
      2006-03-26 16:51:00,915 WARN [org.jgroups.protocols.pbcast.GMS] I (elik:3638 (additional data: 18 bytes)) am being shunned, will leave and rejoin group (prev_members are [SHIMI:1096 (additional data: 18 bytes) elik:3638 (additional data: 18 bytes) ])
      2006-03-26 16:52:27,400 ERROR [org.jgroups.protocols.pbcast.GMS] down_handler thread for GMS was interrupted (in order to be terminated), but is is still alive
      2006-03-26 16:52:40,415 ERROR [org.jgroups.protocols.pbcast.STABLE] down_handler thread for STABLE was interrupted (in order to be terminated), but is is still alive
      2006-03-26 16:52:47,431 ERROR [org.jgroups.protocols.UDP] down_handler thread for UDP was interrupted (in order to be terminated), but is is still alive
      


      and when the cpu is back to normal, members are added to the cluter again and again.

      2006-03-26 16:54:57,134 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, elik:3638 (additional data: 18 bytes) is not a member of view [SHIMI:1096 (additional data: 18 bytes)|7] [SHIMI:1096 (additional data: 18 bytes), elik:3860 (additional data: 18 bytes)]; discarding view
      2006-03-26 16:54:57,368 WARN [org.jgroups.protocols.pbcast.GMS] I (elik:3638 (additional data: 18 bytes)) am being shunned, will leave and rejoin group (prev_members are [SHIMI:1096 (additional data: 18 bytes) elik:3638 (additional data: 18 bytes) ])
      2006-03-26 16:54:57,415 INFO [org.jboss.ha.framework.interfaces.HAPartition.BUILD_2_0_0_45] New cluster view for partition BUILD_2_0_0_45: 7 ([192.168.10.92:1099, 192.168.10.49:1099] delta: 0)
      2006-03-26 16:54:57,493 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] I am (192.168.10.49:1099) received membershipChanged event:
      2006-03-26 16:54:57,493 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] Dead members: 0 ([])
      2006-03-26 16:54:57,493 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] New Members : 0 ([])
      2006-03-26 16:54:57,493 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] All Members : 2 ([192.168.10.92:1099, 192.168.10.49:1099])
      2006-03-26 16:54:57,571 INFO [org.jboss.proxy.ejb.ProxyFactory] Bound EJB Home 'telmapLocationProvider' to jndi 'telmapLocationProvider'
      2006-03-26 16:54:57,571 INFO [org.jboss.proxy.ejb.ProxyFactory] Bound EJB Home 'gateway' to jndi 'gateway'
      2006-03-26 16:54:57,571 INFO [org.jboss.proxy.ejb.ProxyFactory] Bound EJB Home 'fakeLocationProvider' to jndi 'fakeLocationProvider'
      2006-03-26 16:54:57,571 INFO [org.jboss.proxy.ejb.ProxyFactory] Bound EJB Home 'dummyLocationProvider' to jndi 'dummyLocationProvider'
      2006-03-26 16:54:57,743 ERROR [org.jgroups.protocols.UDP] dest address of message is null, and sending to default address fails as mcast_addr is null, too ! Discarding message DistributedReplicantManager._add(jboss.j2ee:jndiName=telmapLocationProvider,service=EJB, 192.168.10.49:1099, JRMPInvoker_Stub[UnicastRef2 [liveRef: [endpoint:[192.168.10.49:4447](remote),objID:[3]]]])
      2006-03-26 16:54:58,275 INFO [STDOUT]
      -------------------------------------------------------
      GMS: address is elik:3866 (additional data: 18 bytes)
      -------------------------------------------------------
      2006-03-26 16:55:00,306 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, elik:3638 (additional data: 18 bytes) is not a member of view [SHIMI:1096 (additional data: 18 bytes)|8] [SHIMI:1096 (additional data: 18 bytes), elik:3860 (additional data: 18 bytes), elik:3866 (additional data: 18 bytes)]; discarding view
      2006-03-26 16:55:00,306 WARN [org.jgroups.protocols.pbcast.GMS] I (elik:3638 (additional data: 18 bytes)) am being shunned, will leave and rejoin group (prev_members are [SHIMI:1096 (additional data: 18 bytes) elik:3638 (additional data: 18 bytes) ])
      2006-03-26 16:55:00,306 INFO [org.jboss.ha.framework.interfaces.HAPartition.BUILD_2_0_0_45] Suspected member: SHIMI:1096 (additional data: 18 bytes)
      2006-03-26 16:55:00,306 INFO [org.jboss.ha.framework.interfaces.HAPartition.BUILD_2_0_0_45] New cluster view for partition BUILD_2_0_0_45: 8 ([192.168.10.92:1099, 192.168.10.49:1099, 192.168.10.49:1099] delta: 1)
      2006-03-26 16:55:00,306 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] I am (192.168.10.49:1099) received membershipChanged event:
      2006-03-26 16:55:00,306 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] Dead members: 0 ([])
      2006-03-26 16:55:00,337 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] New Members : 0 ([])
      2006-03-26 16:55:00,337 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] All Members : 3 ([192.168.10.92:1099, 192.168.10.49:1099, 192.168.10.49:1099])
      2006-03-26 16:55:00,446 INFO [org.jboss.proxy.ejb.ProxyFactory] Bound EJB Home 'telmapLocationProvider' to jndi 'telmapLocationProvider'
      2006-03-26 16:55:00,462 WARN [org.jgroups.protocols.pbcast.NAKACK] [elik:3638 (additional data: 18 bytes)] discarded message from non-member elik:3866 (additional data: 18 bytes)
      2006-03-26 16:55:00,462 INFO [org.jboss.proxy.ejb.ProxyFactory] Bound EJB Home 'gateway' to jndi 'gateway'
      2006-03-26 16:55:00,462 WARN [org.jgroups.protocols.pbcast.NAKACK] [elik:3638 (additional data: 18 bytes)] discarded message from non-member elik:3866 (additional data: 18 bytes)
      2006-03-26 16:55:00,462 INFO [org.jboss.proxy.ejb.ProxyFactory] Bound EJB Home 'fakeLocationProvider' to jndi 'fakeLocationProvider'
      2006-03-26 16:55:00,462 WARN [org.jgroups.protocols.pbcast.NAKACK] [elik:3638 (additional data: 18 bytes)] discarded message from non-member elik:3866 (additional data: 18 bytes)
      2006-03-26 16:55:00,493 INFO [org.jboss.proxy.ejb.ProxyFactory] Bound EJB Home 'dummyLocationProvider' to jndi 'dummyLocationProvider'
      2006-03-26 16:55:00,493 WARN [org.jgroups.protocols.pbcast.NAKACK] [elik:3638 (additional data: 18 bytes)] discarded message from non-member elik:3866 (additional data: 18 bytes)
      2006-03-26 16:55:00,493 WARN [org.jgroups.protocols.pbcast.NAKACK] [elik:3638 (additional data: 18 bytes)] discarded message from non-member elik:3866 (additional data: 18 bytes)
      2006-03-26 16:55:00,493 WARN [org.jgroups.protocols.pbcast.NAKACK] [elik:3638 (additional data: 18 bytes)] discarded message from non-member elik:3866 (additional data: 18 bytes)
      2006-03-26 16:55:00,509 WARN [org.jgroups.protocols.pbcast.NAKACK] [elik:3638 (additional data: 18 bytes)] discarded message from non-member elik:3866 (additional data: 18 bytes)
      2006-03-26 16:55:23,415 INFO [org.jboss.ha.framework.interfaces.HAPartition.BUILD_2_0_0_45] Suspected member: elik:3860 (additional data: 18 bytes)
      2006-03-26 16:55:23,415 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr elik:3860 (additional data: 18 bytes) is not a member !
      2006-03-26 16:55:23,415 INFO [org.jboss.ha.framework.interfaces.HAPartition.BUILD_2_0_0_45] Suspected member: elik:3860 (additional data: 18 bytes)
      2006-03-26 16:55:23,712 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, elik:3638 (additional data: 18 bytes) is not a member of view [SHIMI:1096 (additional data: 18 bytes)|9] [SHIMI:1096 (additional data: 18 bytes), elik:3866 (additional data: 18 bytes)]; discarding view
      2006-03-26 16:55:23,712 INFO [org.jboss.ha.framework.interfaces.HAPartition.BUILD_2_0_0_45] New cluster view for partition BUILD_2_0_0_45: 9 ([192.168.10.92:1099, 192.168.10.49:1099] delta: -1)
      2006-03-26 16:55:23,712 WARN [org.jgroups.protocols.pbcast.GMS] I (elik:3638 (additional data: 18 bytes)) am being shunned, will leave and rejoin group (prev_members are [SHIMI:1096 (additional data: 18 bytes) elik:3638 (additional data: 18 bytes) ])
      2006-03-26 16:55:23,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] I am (192.168.10.49:1099) received membershipChanged event:
      2006-03-26 16:55:23,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] Dead members: 0 ([])
      2006-03-26 16:55:23,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] New Members : 0 ([])
      2006-03-26 16:55:23,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] All Members : 2 ([192.168.10.92:1099, 192.168.10.49:1099])
      2006-03-26 16:55:24,556 INFO [STDOUT]
      


      and so on e.g.:
      2006-03-26 17:20:42,446 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] I am (192.168.10.49:1099) received membershipChanged event:
      2006-03-26 17:20:42,446 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] Dead members: 0 ([])
      2006-03-26 17:20:42,446 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] New Members : 0 ([])
      2006-03-26 17:20:42,446 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.BUILD_2_0_0_45] All Members : 5 ([192.168.10.92:1099, 192.168.10.49:1099, 192.168.10.49:1099, 192.168.10.49:1099, 192.168.10.49:1099])