1 Reply Latest reply on Sep 22, 2016 5:06 AM by galder.zamarreno

    Scaling infinispan server 8.2.2 with Remote Event Listeners

    whethsmith

      Greetings community,

       

      We currently have a 5-node infinispan server cluster running in production and it is able to handle up to 2 thousand requests per second.  Our desire is to have the cache scale linearly, up 10x or more.   In our stress tests, our app servers start getting SocketTimeoutExceptions from infinspan after 30 minutes under planned future load.

      Our most active cache runs in distributed mode with 2 owners and 20 segments.  One of the bottlenecks appear to be with our  pub/sub system using the remote event listeners.  Basically whenever a cache entry is modified, our remote event listeners get notified, who then in turn respond to long-polling requests.   The remote event listeners are running in Java servlets on tomcat.

       

      About a month ago, we ran into the issue found here:  Show stopper: Infinispan hot rod server gets stuck / dead lock in high load with registered client listener in hot rod client - infinispan-server-8.2.1.Final

      and after patching infinispan server with an increased event queue size (now at 1 million) we were able to scale up quite far, just not as much as we'd like.

       

       

      In terms of actual errors on the server, what we see are things like:

      ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (pool-6-thread-1) ISPN000136: Error executing command RemoveExpiredCommand, writing keys [[B0x033e183537626430..[27]]: org.infinispan.util.concurrent.TimeoutExcepti

      on: ISPN000299: Unable to acquire lock after 30 seconds for key [B0x033e183537626430..[27] and requestor CommandUUID{address=XYZ, id=851936}. Lock is held by CommandUUID{address=XYZ, id=851819}

       

      And below is the cache config:

       

                  <subsystem xmlns="urn:infinispan:server:core:8.2" default-cache-container="clustered">

                      <cache-container name="clustered" default-cache="default" statistics="true">

                          <transport lock-timeout="60000" />

                          <distributed-cache name="default" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER">

                              <locking acquire-timeout="30000" concurrency-level="10000" striping="false"/>

                              <transaction mode="NONE"/>

                              <expiration lifespan="86400000" max-idle="900000" interval="60000"/>

                              <eviction strategy="LIRS" size="1000000"/>

                          </distributed-cache>

                          <distributed-cache name="gameStateCache" mode="SYNC" remote-timeout="30000" start="EAGER">

                              <locking acquire-timeout="30000" concurrency-level="10000" striping="false"/>

                              <expiration lifespan="86400000" max-idle="900000" interval="60000"/>

                              <eviction strategy="LIRS" size="1000000"/>

                          </distributed-cache>

                      </cache-container>

                  </subsystem>

       

       

      Is anyone aware of high-load issues with the remote event listener system or perhaps suggest alternative configuration?

       

      Cheers,

      Brian