7 Replies Latest reply on May 10, 2016 10:01 AM by rhusar

    Cluster node not responding and KeyAffinityServiceImpl going infinite loop with high CPU load in Wildfly 8

    sammychu

      We are constantly experiencing cluster node not responding in Wildfly 8 with a high CPU load. After enabling TRACE log on jgroups/ infinispan related component, we found the KeyAffinityServiceImpl going into infinite loop without responding (We can quite confirmed it is dead in the loop because we left it over night and it cannot be recovered...).

      Our clustering environment as following:

      2 VM with CentOS 6.4, located in the same subnet (Node 1: 10.1.3.5, Node 2: 10.1.3.6)

      Wildfly 8.2.1.Final with standalone-full-ha.xml profile, using UDP stack, distribution web cache container (i.e. using default config)

       

      After tracking the TRACE log, we suspected the following problematic log message:

      Node 1 (10.1.3.5):

      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode2/web to cache myear.ear.mywar.war, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode2/web to cache myear.ear.mywar.war, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode2/web to cache dist, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode2/web to cache dist, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode1/web to cache myear.ear.mywar.war, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode2/web to cache myear.ear.mywar.war, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode1/web to cache myear.ear.mywar.war, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode1/web to cache dist, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode2/web to cache dist, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,366 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Trying to add node clusternode1/web to cache dist, but it is already a member: members = [clusternode1/web, clusternode2/web], joiners = []
      2016-04-18 15:51:40,367 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread-13) {} Initializing rebalance policy for cache myear.ear.mywar.war, pre-existing partitions are [CacheTopology{id=4, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode2/web]}, pendingCH=null}, CacheTopology{id=2, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}]
      2016-04-18 15:51:40,367 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Cluster members updated for cache myear.ear.mywar.war, no leavers detected: cache members = [clusternode1/web, clusternode2/web]
      2016-04-18 15:51:40,388 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Cache myear.ear.mywar.war topology updated: members = [clusternode1/web, clusternode2/web], joiners = [], topology = CacheTopology{id=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}
      2016-04-18 15:51:40,388 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread-13) {} Updating cluster-wide consistent hash for cache myear.ear.mywar.war, topology = CacheTopology{id=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}
      2016-04-18 15:51:40,388 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (transport-thread-13) {} dests=null, command=CacheTopologyControlCommand{cache=myear.ear.mywar.war, type=CH_UPDATE, sender=clusternode1/web, joinInfo=null, topologyId=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null, throwable=null, viewId=3}, mode=ASYNCHRONOUS_WITH_SYNC_MARSHALLING, timeout=60000
      2016-04-18 15:51:40,388 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (transport-thread-13) {} Replication task sending CacheTopologyControlCommand{cache=myear.ear.mywar.war, type=CH_UPDATE, sender=clusternode1/web, joinInfo=null, topologyId=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null, throwable=null, viewId=3} to single recipient clusternode2/web with response mode GET_NONE
      2016-04-18 15:51:40,388 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread-15) {} Attempting to execute command on self: CacheTopologyControlCommand{cache=myear.ear.mywar.war, type=CH_UPDATE, sender=clusternode1/web, joinInfo=null, topologyId=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null, throwable=null, viewId=3}
      2016-04-18 15:51:40,389 DEBUG [org.infinispan.topology.LocalTopologyManagerImpl] (transport-thread-15) {} Updating local consistent hash(es) for cache myear.ear.mywar.war: new topology = CacheTopology{id=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}
      2016-04-18 15:51:40,389 TRACE [org.infinispan.topology.CacheTopology] (transport-thread-15) {} Current consistent hash's routing table: 0: 1 0, 1: 1 0, 2: 1 0, 3: 1 0, 4: 1 0, 5: 1 0, 6: 1 0, 7: 1 0, 8: 1 0, 9: 1 0, 10: 1 0, 11: 1 0, 12: 1 0, 13: 1 0, 14: 1 0, 15: 1 0, 16: 1 0, 17: 1 0, 18: 1 0, 19: 1 0, 20: 1 0, 21: 1 0, 22: 1 0, 23: 1 0, 24: 1 0, 25: 1 0, 26: 1 0, 27: 1 0, 28: 1 0, 29: 1 0, 30: 1 0, 31: 1 0, 32: 1 0, 33: 1 0, 34: 1 0, 35: 1 0, 36: 1 0, 37: 1 0, 38: 1 0, 39: 1 0, 40: 1 0, 41: 1 0, 42: 1 0, 43: 1 0, 44: 1 0, 45: 1 0, 46: 1 0, 47: 1 0, 48: 1 0, 49: 1 0, 50: 1 0, 51: 1 0, 52: 1 0, 53: 1 0, 54: 1 0, 55: 1 0, 56: 1 0, 57: 1 0, 58: 1 0, 59: 1 0, 60: 1 0, 61: 1 0, 62: 1 0, 63: 1 0, 64: 1 0, 65: 1 0, 66: 1 0, 67: 1 0, 68: 1 0, 69: 1 0, 70: 1 0, 71: 1 0, 72: 1 0, 73: 1 0, 74: 1 0, 75: 1 0, 76: 1 0, 77: 1 0, 78: 1 0, 79: 1 0
      2016-04-18 15:51:40,389 TRACE [org.jgroups.protocols.RSVP] (transport-thread-13) {} clusternode1/web: REQ --> clusternode2/web
      2016-04-18 15:51:40,389 TRACE [org.jgroups.protocols.UNICAST3] (transport-thread-13) {} clusternode1/web --> DATA(clusternode2/web: #8, conn_id=2)
      2016-04-18 15:51:40,390 TRACE [org.jgroups.protocols.TCP] (transport-thread-13) {} null: sending msg to clusternode2/web, src=clusternode1/web, headers are RequestCorrelator: id=200, type=REQ, id=18, rsp_expected=false, RSVP: REQ(7), UNICAST3: DATA, seqno=8, conn_id=2, TCP: [channel_name=web]
      2016-04-18 15:51:40,390 TRACE [org.jgroups.protocols.TCP] (transport-thread-13) {} dest=10.1.3.6:7900 (4177 bytes)
      2016-04-18 15:51:40,392 TRACE [org.jgroups.protocols.TCP] (Incoming-16,shared=tcp) {} null: received [dst: clusternode1/web, src: clusternode2/web (3 headers), size=0 bytes, flags=DONT_BUNDLE|RSVP|INTERNAL], headers are RSVP: RSP(7), UNICAST3: DATA, seqno=50, conn_id=1, TCP: [channel_name=web]
      2016-04-18 15:51:40,392 TRACE [org.jgroups.protocols.UNICAST3] (Incoming-16,shared=tcp) {} clusternode1/web <-- DATA(clusternode2/web: #50, conn_id=1)
      2016-04-18 15:51:40,392 TRACE [org.jgroups.protocols.UNICAST3] (Incoming-16,shared=tcp) {} clusternode1/web: delivering #50 - #50 (1 messages)
      2016-04-18 15:51:40,396 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (transport-thread-13) {} Response: null
      2016-04-18 15:51:40,397 TRACE [org.infinispan.topology.DefaultRebalancePolicy] (transport-thread-13) {} Cache myear.ear.mywar.war status changed: joiners=[], topology=CacheTopology{id=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}
      2016-04-18 15:51:40,397 TRACE [org.infinispan.topology.DefaultRebalancePolicy] (transport-thread-13) {} Not triggering rebalance for cache myear.ear.mywar.war, no joiners and the current consistent hash is already balanced
      2016-04-18 15:51:40,397 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread-13) {} Initializing rebalance policy for cache dist, pre-existing partitions are [CacheTopology{id=4, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode2/web]}, pendingCH=null}, CacheTopology{id=2, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}]
      2016-04-18 15:51:40,397 TRACE [org.infinispan.affinity.KeyAffinityServiceImpl] (transport-thread-15) {} TopologyChangedEvent received: EventImpl{type=TOPOLOGY_CHANGED, pre=false, cache=Cache 'myear.ear.mywar.war'@clusternode1/web, consistentHashAtStart=GroupingConsistentHash:DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, consistentHashAtEnd=GroupingConsistentHash:DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, newTopologyId=6}
      2016-04-18 15:51:40,397 TRACE [org.infinispan.affinity.KeyAffinityServiceImpl] (transport-thread-15) {} Skipping address: clusternode2/web
      2016-04-18 15:51:40,398 TRACE [org.infinispan.affinity.KeyAffinityServiceImpl] (transport-thread-15) {} resetNumberOfKeys ends with: maxNumberOfKeys=10, existingKeyCount=0
      2016-04-18 15:51:40,398 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Cluster members updated for cache dist, no leavers detected: cache members = [clusternode1/web, clusternode2/web]
      2016-04-18 15:51:40,398 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-13) {} Cache dist topology updated: members = [clusternode1/web, clusternode2/web], joiners = [], topology = CacheTopology{id=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}
      2016-04-18 15:51:40,398 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread-13) {} Updating cluster-wide consistent hash for cache dist, topology = CacheTopology{id=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}
      2016-04-18 15:51:40,399 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (transport-thread-13) {} dests=null, command=CacheTopologyControlCommand{cache=dist, type=CH_UPDATE, sender=clusternode1/web, joinInfo=null, topologyId=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null, throwable=null, viewId=3}, mode=ASYNCHRONOUS_WITH_SYNC_MARSHALLING, timeout=60000
      2016-04-18 15:51:40,399 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (transport-thread-13) {} Replication task sending CacheTopologyControlCommand{cache=dist, type=CH_UPDATE, sender=clusternode1/web, joinInfo=null, topologyId=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null, throwable=null, viewId=3} to single recipient clusternode2/web with response mode GET_NONE
      2016-04-18 15:51:40,399 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread-16) {} Attempting to execute command on self: CacheTopologyControlCommand{cache=dist, type=CH_UPDATE, sender=clusternode1/web, joinInfo=null, topologyId=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null, throwable=null, viewId=3}
      2016-04-18 15:51:40,399 DEBUG [org.infinispan.topology.LocalTopologyManagerImpl] (transport-thread-16) {} Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=6, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[clusternode1/web, clusternode2/web]}, pendingCH=null}
      2016-04-18 15:51:40,399 TRACE [org.infinispan.topology.CacheTopology] (transport-thread-16) {} Current consistent hash's routing table: 0: 1 0, 1: 1 0, 2: 1 0, 3: 1 0, 4: 1 0, 5: 1 0, 6: 1 0, 7: 1 0, 8: 1 0, 9: 1 0, 10: 1 0, 11: 1 0, 12: 1 0, 13: 1 0, 14: 1 0, 15: 1 0, 16: 1 0, 17: 1 0, 18: 1 0, 19: 1 0, 20: 1 0, 21: 1 0, 22: 1 0, 23: 1 0, 24: 1 0, 25: 1 0, 26: 1 0, 27: 1 0, 28: 1 0, 29: 1 0, 30: 1 0, 31: 1 0, 32: 1 0, 33: 1 0, 34: 1 0, 35: 1 0, 36: 1 0, 37: 1 0, 38: 1 0, 39: 1 0, 40: 1 0, 41: 1 0, 42: 1 0, 43: 1 0, 44: 1 0, 45: 1 0, 46: 1 0, 47: 1 0, 48: 1 0, 49: 1 0, 50: 1 0, 51: 1 0, 52: 1 0, 53: 1 0, 54: 1 0, 55: 1 0, 56: 1 0, 57: 1 0, 58: 1 0, 59: 1 0, 60: 1 0, 61: 1 0, 62: 1 0, 63: 1 0, 64: 1 0, 65: 1 0, 66: 1 0, 67: 1 0, 68: 1 0, 69: 1 0, 70: 1 0, 71: 1 0, 72: 1 0, 73: 1 0, 74: 1 0, 75: 1 0, 76: 1 0, 77: 1 0, 78: 1 0, 79: 1 0
      2016-04-18 15:51:40,400 TRACE [org.jgroups.protocols.RSVP] (transport-thread-13) {} clusternode1/web: REQ --> clusternode2/web
      2016-04-18 15:51:40,400 TRACE [org.jgroups.protocols.UNICAST3] (transport-thread-13) {} clusternode1/web --> DATA(clusternode2/web: #9, conn_id=2)
      2016-04-18 15:51:40,400 TRACE [org.jgroups.protocols.TCP] (transport-thread-13) {} null: sending msg to clusternode2/web, src=clusternode1/web, headers are RequestCorrelator: id=200, type=REQ, id=19, rsp_expected=false, RSVP: REQ(8), UNICAST3: DATA, seqno=9, conn_id=2, TCP: [channel_name=web]
      2016-04-18 15:51:40,400 TRACE [org.jgroups.protocols.TCP] (transport-thread-13) {} dest=10.1.3.6:7900 (4166 bytes)
      2016-04-18 15:51:40,418 TRACE [org.infinispan.affinity.KeyAffinityServiceImpl] (KeyAffinityService Thread Pool -- 1) {} KeyGeneratorWorker marked as ACTIVE
      2016-04-18 15:51:40,430 TRACE [org.infinispan.affinity.KeyAffinityServiceImpl] (KeyAffinityService Thread Pool -- 1) {} KeyGeneratorWorker marked as INACTIVE
      ... Going into KeyGeneratorWorker marked as ACTIVE/ INACTIVE loop ...
      
      
      

       

      The suspected log message is:

      2016-04-18 15:51:40,389 TRACE [org.infinispan.topology.CacheTopology] (transport-thread-15) {} Current consistent hash's routing table: 0: 1 0, 1: 1 0, 2: 1 0, 3: 1 0, 4: 1 0, 5: 1 0, 6: 1 0, 7: 1 0, 8: 1 0, 9: 1 0, 10: 1 0, 11: 1 0, 12: 1 0, 13: 1 0, 14: 1 0, 15: 1 0, 16: 1 0, 17: 1 0, 18: 1 0, 19: 1 0, 20: 1 0, 21: 1 0, 22: 1 0, 23: 1 0, 24: 1 0, 25: 1 0, 26: 1 0, 27: 1 0, 28: 1 0, 29: 1 0, 30: 1 0, 31: 1 0, 32: 1 0, 33: 1 0, 34: 1 0, 35: 1 0, 36: 1 0, 37: 1 0, 38: 1 0, 39: 1 0, 40: 1 0, 41: 1 0, 42: 1 0, 43: 1 0, 44: 1 0, 45: 1 0, 46: 1 0, 47: 1 0, 48: 1 0, 49: 1 0, 50: 1 0, 51: 1 0, 52: 1 0, 53: 1 0, 54: 1 0, 55: 1 0, 56: 1 0, 57: 1 0, 58: 1 0, 59: 1 0, 60: 1 0, 61: 1 0, 62: 1 0, 63: 1 0, 64: 1 0, 65: 1 0, 66: 1 0, 67: 1 0, 68: 1 0, 69: 1 0, 70: 1 0, 71: 1 0, 72: 1 0, 73: 1 0, 74: 1 0, 75: 1 0, 76: 1 0, 77: 1 0, 78: 1 0, 79: 1 0

      It going into infinite loop because it cannot generate session key which is belongs to the current node (because the consistent hash's routing table pointed that all session key is belongs to another node...).

      And noted that before and after this issue happened, those two nodes are ping-able with each other.

       

      Is this a known issue in Wildfly 8 or any workaround we can applied? Attached the (filtered) server log for checking...