8 Replies Latest reply on Apr 18, 2016 5:42 AM by Eli Z

    Show stopper: Infinispan hot rod server gets stuck / dead lock in high load with registered client listener in hot rod client - infinispan-server-8.2.1.Final

    Eli Z Newbie

      Hi,

       

      In our application usage use case we are using latest infinispan-server-8.2.1.Final configured in domain mode with two nodes (one server in each node) and configured replicated cache.

       

      We have in our client application a hot rod client defined in cluster mode and we have a client listener registered to receive cache events from the remote cache on the infinispan server.

       

      What we experience is a show stopper for us:

      1. In high load ~ 5000-6000 TPS ==> 5000-6000 PUTs per seconds

      2. For a few minutes everything works fine and all operations are ok and we recieve the client events in our application as expected.

      3. After a few minutes, the infinispan cluster is not responsive - all our hot rod client connections and threads are stuck for very long time.

      4. In our server we see the number of connections in CLOSE_WAIT increase all the time (probably due to the client socket timeout)

      5. We see that in all stuck connections to hot rode server we have bytes in the Rec-Q (using netstat)


      Our analysis:

      1. We took jstack thread dumps of the infinispan server  (see attached)

      2. We see that all hot rod server threads are in WAITING mode on:

       

      HotRodServerWorker-6-500" #914 prio=5 os_prio=0 tid=0x00007ff8643ad000 nid=0x10f1 waiting on condition [0x00007ff8238b8000]

         java.lang.Thread.State: WAITING (parking)

        at sun.misc.Unsafe.park(Native Method)

        - parking to wait for  <0x0000000783417290> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)

        at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)

        at org.infinispan.server.hotrod.ClientListenerRegistry$BaseClientEventSender.sendEvent(ClientListenerRegistry.scala:271)

        at org.infinispan.server.hotrod.ClientListenerRegistry$BaseClientEventSender.onCacheEvent(ClientListenerRegistry.scala:237)

        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)

       

      3. We see the jgroup Incoming thread is in WAITING state as well on the same code:

       

      "Incoming-107,master:server-one" #1204 prio=5 os_prio=0 tid=0x00007ff8d805a000 nid=0x37c5 waiting on condition [0x00007ff82b130000]

         java.lang.Thread.State: WAITING (parking)

        at sun.misc.Unsafe.park(Native Method)

        - parking to wait for  <0x0000000783417290> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)

        at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)

        at org.infinispan.server.hotrod.ClientListenerRegistry$BaseClientEventSender.sendEvent(ClientListenerRegistry.scala:271)

        at org.infinispan.server.hotrod.ClientListenerRegistry$BaseClientEventSender.onCacheEvent(ClientListenerRegistry.scala:237)

        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

       

      4. It seems as if the client listener sender event Q is full so all threads are waiting on the put operation until free space is available in the Q.

      5. The thing is that the queue is not getting drained from events. No operation / thread i spolling from the q.

      6.May be due to problems in connections? Channel closed?

      7. This does not seem to happen when no client listener is registered.in the client.

       

       

      Please advise on this issue ASAP if possible. We are stuck in our development proccess.

       

      What is the capacity the server can withstand with client listener cache events sending?

       

      Our setup::

      1. VM

      2. Linux SitePDP1 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

      3. java version "1.8.0_74"

      4. 4  cores

      processor       : 0

      vendor_id       : GenuineIntel

      cpu family      : 6

      model           : 44

      model name      : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

      stepping        : 2

      cpu MHz         : 2400.085

      cache size      : 12288 KB

      physical id     : 0

      siblings        : 4

      core id         : 0

      cpu cores       : 4

      apicid          : 0

      initial apicid  : 0

      fpu             : yes

      fpu_exception   : yes

      cpuid level     : 11

      wp              : yes

      flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat epb dts

      bogomips        : 4800.17

      clflush size    : 64

      cache_alignment : 64

      address sizes   : 40 bits physical, 48 bits virtual

      power management:

       

       

      Attached:

      1. infinispan domain configuration files

      2. servers jstack thread dumps (server-one which got stuck & server-three)

       

      Our hot rod client configuration code + listener:

       

      private void initHotRodClient(){

        org.infinispan.client.hotrod.configuration.ConfigurationBuilder cb

        = new org.infinispan.client.hotrod.configuration.ConfigurationBuilder();

        cb.tcpNoDelay(true)

       

        .tcpKeepAlive(true)

        .connectionTimeout(0)

        .socketTimeout(300000)

        .connectionPool()

        .maxTotal(configuration.getMaxTotal())

        .maxIdle(configuration.getMaxIdle())

        .maxWait(configuration.getMaxWaitMillis())

         //.numTestsPerEvictionRun(3)
         .testOnBorrow(false)

        .testOnReturn(false)

        .testWhileIdle(false)

         //.minEvictableIdleTime(300000L)
        //.timeBetweenEvictionRuns(5000)
        //.minIdle(1)
        //.addServer()
        //.host("127.0.0.1")
        //.port(11222)
         .addCluster("clustered")

        .addClusterNode("nodeA",11222)

        .addClusterNode("nodeB",11222);

        RemoteCacheManager rcm = new RemoteCacheManager(cb.build());

        rcm.start();

        RedisCacheClientManagementServiceHolder.setRcm(rcm);

        RedisCacheClientManagementServiceHolder.setRemoteCache(rcm.getCache());

        RedisCacheClientManagementServiceHolder.getRemoteCache().addClientListener(new RemoteCacheEventListener());

      }

       

      @ClientListener
      public class RemoteCacheEventListener {

         private static final Log LOG = LogFactory.getLog(RemoteCacheEventListener.class);

       

         @ClientCacheEntryCreated
         public void handleCacheEntryCreatedRemoteEvent(ClientCacheEntryCreatedEvent<String> event) {

         //LOG.info("New entry " + event.getKey() + " created in the cache. Entry value:" + event.getValue());
         LOG.info("New entry " + event.getKey() + " created in the cache.");

        }

       

         @ClientCacheEntryModified
         public void handleCacheEntryModifiedRemoteEvent(ClientCacheEntryModifiedEvent<String> event) {

         LOG.info("Modified entry " + event.getKey() + " in the cache.");

        }

       

         @ClientCacheEntryRemoved
         public void handleCacheEntryRemovedRemoteEvent(ClientCacheEntryRemovedEvent<String> event) {

         LOG.info("Removed entry " + event.getKey() + " from the cache.");

       

        }

       

         @ClientCacheEntryExpired
         public void handleCacheEntryExpiredRemoteEvent(ClientCacheEntryExpiredEvent<String> event) {

         LOG.info("Expired entry " + event.getKey() + " in the cache.");

        }

       

      }