6 Replies Latest reply on Dec 22, 2015 1:30 PM by mmr11408

    Two issues: All cache is not propagated properly at startup, clusters combined even though they use different UPD addresses

    mmr11408

       

      Sorry for the long append. Hopefully it provides all the details needed to diagnose the problems.

       

      The first issue was encountered after the upgrade from 6.0 to 8.0.1. The second was encountered when the applications were deployed to prod. Since no code change was required for the upgrade but the configuration files had to be changed extensively I suspect both issues are related to that change.

       

      Background: I have two java applications that run in tomcat 8, using java 8. One, cache manager, loads the caches from the records in the DB and occasionally updates the cache when records in the DB are changed. The other, app server, uses the caches in read-only mode to respond to client queries. In each environment (dev, QA, PP and prod), two instances of each application is started on two separate hosts, cache manager 1 and app server 1 on host 1 and cache manager 2 and app server 2 on host 2.

       

      The caches are replicated, synchronized, clustered caches using jgroup UDP. The cache managers hold all the data in memory and have no eviction or expiration. The app servers hold 20,000 records in memory for each cache and use a single file cache store to hold extra records.

       

      ========

      Issue one: 

      ========

      A cache manager is started first which loads the cache from the DB. When a second cache manager is started, it gets its caches preloaded from the other cache manager so it just sits around until it detects that the other cache manager has terminated, at which point it takes responsibility for keeping the cache up to date with the DB. When app servers are started, the number of entries in their caches does not match. One app server has fewer records in the cache than the other. This issue is observed in all environments (dev, QA, PP and prod). When the code and configuration is switched back to 6.0 this behavior is not observed. There is nothing in the log to indicate an error.

       

      Any idea what is the cause? If not, what logging can be turned on to diagnose this?

       

       

       

      Below is a snippet of a cache manager config file:

       

        <jgroups transport="org.infinispan.remoting.transport.jgroups.JGroupsTransport">

            <stack-file name="udp" path="jgroups.xml"/>

         </jgroups>

       

         <cache-container default-cache="default"> 

           <replicated-cache name="MailboxCache" mode="SYNC" remote-timeout="20000" statistics="true">

                <eviction strategy="NONE" max-entries="-1" />

                <expiration max-idle="-1" lifespan="-1" />

                <store-as-binary keys="false" values="false" />

                <persistence passivation="false" />

                <state-transfer enabled="true" timeout="240000" chunk-size="18181" await-initial-transfer="true"/>

           </replicated-cache>

       

      Below is a snippet of an app server config file:

       

         <jgroups transport="org.infinispan.remoting.transport.jgroups.JGroupsTransport">

            <stack-file name="udp" path="jgroups.xml"/>

         </jgroups>

       

       

         <cache-container default-cache="default"> 

           <replicated-cache name="MailboxCache" mode="SYNC" remote-timeout="90000" statistics="true" >

              <locking isolation="READ_COMMITTED" />

              <eviction strategy="LIRS" max-entries="20000" />

              <expiration max-idle="-1" lifespan="-1" />

              <store-as-binary keys="false" values="false" />

              <persistence passivation="false" >

                  <file-store path="Infinispan-SingleFileCacheStore" />

              </persistence>

              <versioning scheme="NONE" /> 

              <indexing index="NONE" />

              <state-transfer enabled="true" timeout="940000" chunk-size="18181" await-initial-transfer="true" />

         </replicated-cache>

       

      Jgroup configuration is the same for both apps and is pasted here in full:

       

      <config xmlns="urn:org:jgroups"

      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

      xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.6.xsd">

       

         <UDP mcast_addr="${jgroups.udp.mcast_addr:228.6.7.8}"

      mcast_port="${jgroups.udp.mcast_port:46655}"

      ucast_send_buf_size="1m"

              mcast_send_buf_size="1m"

      ucast_recv_buf_size="20m"

      mcast_recv_buf_size="25m"

      ip_ttl="${jgroups.ip_ttl:2}"

      thread_naming_pattern="pl"

      enable_diagnostics="false"

       

      thread_pool.min_threads="${jgroups.thread_pool.min_threads:2}"

      thread_pool.max_threads="${jgroups.thread_pool.max_threads:30}"

      thread_pool.keep_alive_time="60000"

      thread_pool.queue_enabled="false"

       

      internal_thread_pool.min_threads="${jgroups.internal_thread_pool.min_threads:5}" 

              internal_thread_pool.max_threads="${jgroups.internal_thread_pool.max_threads:20}"

      internal_thread_pool.keep_alive_time="60000"

      internal_thread_pool.queue_enabled="true"

      internal_thread_pool.queue_max_size="500"

       

              oob_thread_pool.min_threads="${jgroups.oob_thread_pool.min_threads:20}" 

      oob_thread_pool.max_threads="${jgroups.oob_thread_pool.max_threads:200}"

      oob_thread_pool.keep_alive_time="60000"

      oob_thread_pool.queue_enabled="false"

         />

       

         <PING />

         <MERGE3 min_interval="10000"

      max_interval="30000"

         />

       

         <FD_SOCK />

         <FD_ALL timeout="60000"

      interval="15000"

      timeout_check_interval="5000"

         />

       

         <VERIFY_SUSPECT timeout="5000"

         />

       

         <pbcast.NAKACK2 xmit_interval="1000"

      xmit_table_num_rows="50"

      xmit_table_msgs_per_row="1024"

      xmit_table_max_compaction_time="30000"

      max_msg_batch_size="100"

      resend_last_seqno="true"

         />

       

         <UNICAST3 xmit_interval="500"

      xmit_table_num_rows="50"

      xmit_table_msgs_per_row="1024"

      xmit_table_max_compaction_time="30000"

      max_msg_batch_size="100"

      conn_expiry_timeout="0"

         />

       

         <pbcast.STABLE stability_delay="500"

      desired_avg_gossip="5000"

      max_bytes="1M"

         />

       

         <pbcast.GMS print_local_addr="false"

      join_timeout="15000"

         />

       

         <UFC max_credits="2m"

      min_threshold="0.40"

         />

       

         <MFC max_credits="2m"

      min_threshold="0.40"

         />

       

         <FRAG2 />

      </config>

       

      ========

      Issue two: 

      ========

       

      In production, even though the mcast_addr in the jgroup.xml is different between PP and prod, the clustered members are combined into one and share the cache. In other words, when the first cache manager in prod is started, all of its caches get loaded from the PP legs so it assumes that there is another cache manager up and running whereas it should see that it is the first cache manager and should take responsibility for loading the cache.

       

      PP:

      <UDP mcast_addr="${jgroups.udp.mcast_addr:226.125.1.1}"

      mcast_port="${jgroups.udp.mcast_addr:46655}"

       

      Prod:

      <UDP mcast_addr="${jgroups.udp.mcast_addr:227.125.1.1}"

      mcast_port="${jgroups.udp.mcast_addr:46655}"

       

      Any idea why the clusters are combined?

       

       

       

       

      Message was edited by: Mehdi Rakhshani