6 Replies Latest reply on Dec 22, 2015 1:30 PM by mmr11408

Two issues: All cache is not propagated properly at startup, clusters combined even though they use different UPD addresses

mmr11408 Dec 4, 2015 8:15 AM

Sorry for the long append. Hopefully it provides all the details needed to diagnose the problems.

The first issue was encountered after the upgrade from 6.0 to 8.0.1. The second was encountered when the applications were deployed to prod. Since no code change was required for the upgrade but the configuration files had to be changed extensively I suspect both issues are related to that change.

Background: I have two java applications that run in tomcat 8, using java 8. One, cache manager, loads the caches from the records in the DB and occasionally updates the cache when records in the DB are changed. The other, app server, uses the caches in read-only mode to respond to client queries. In each environment (dev, QA, PP and prod), two instances of each application is started on two separate hosts, cache manager 1 and app server 1 on host 1 and cache manager 2 and app server 2 on host 2.

The caches are replicated, synchronized, clustered caches using jgroup UDP. The cache managers hold all the data in memory and have no eviction or expiration. The app servers hold 20,000 records in memory for each cache and use a single file cache store to hold extra records.

========

Issue one:

========

A cache manager is started first which loads the cache from the DB. When a second cache manager is started, it gets its caches preloaded from the other cache manager so it just sits around until it detects that the other cache manager has terminated, at which point it takes responsibility for keeping the cache up to date with the DB. When app servers are started, the number of entries in their caches does not match. One app server has fewer records in the cache than the other. This issue is observed in all environments (dev, QA, PP and prod). When the code and configuration is switched back to 6.0 this behavior is not observed. There is nothing in the log to indicate an error.

Any idea what is the cause? If not, what logging can be turned on to diagnose this?

Below is a snippet of a cache manager config file:

<stack-file name="udp" path="jgroups.xml"/>

</jgroups>

<cache-container default-cache="default">

<replicated-cache name="MailboxCache" mode="SYNC" remote-timeout="20000" statistics="true">

<store-as-binary keys="false" values="false" />

<state-transfer enabled="true" timeout="240000" chunk-size="18181" await-initial-transfer="true"/>

</replicated-cache>

Below is a snippet of an app server config file:

<stack-file name="udp" path="jgroups.xml"/>

</jgroups>

<cache-container default-cache="default">

<replicated-cache name="MailboxCache" mode="SYNC" remote-timeout="90000" statistics="true" >

<store-as-binary keys="false" values="false" />

<file-store path="Infinispan-SingleFileCacheStore" />

</persistence>

<state-transfer enabled="true" timeout="940000" chunk-size="18181" await-initial-transfer="true" />

</replicated-cache>

Jgroup configuration is the same for both apps and is pasted here in full:

<config xmlns="urn:org:jgroups"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.6.xsd">

<UDP mcast_addr="${jgroups.udp.mcast_addr:228.6.7.8}"

mcast_port="${jgroups.udp.mcast_port:46655}"

ucast_send_buf_size="1m"

mcast_send_buf_size="1m"

ucast_recv_buf_size="20m"

mcast_recv_buf_size="25m"

ip_ttl="${jgroups.ip_ttl:2}"

thread_naming_pattern="pl"

enable_diagnostics="false"

thread_pool.min_threads="${jgroups.thread_pool.min_threads:2}"

thread_pool.max_threads="${jgroups.thread_pool.max_threads:30}"

thread_pool.keep_alive_time="60000"

thread_pool.queue_enabled="false"

internal_thread_pool.min_threads="${jgroups.internal_thread_pool.min_threads:5}"

internal_thread_pool.max_threads="${jgroups.internal_thread_pool.max_threads:20}"

internal_thread_pool.keep_alive_time="60000"

internal_thread_pool.queue_enabled="true"

internal_thread_pool.queue_max_size="500"

oob_thread_pool.min_threads="${jgroups.oob_thread_pool.min_threads:20}"

oob_thread_pool.max_threads="${jgroups.oob_thread_pool.max_threads:200}"

oob_thread_pool.keep_alive_time="60000"

oob_thread_pool.queue_enabled="false"

<MERGE3 min_interval="10000"

max_interval="30000"

<FD_SOCK />

<FD_ALL timeout="60000"

interval="15000"

timeout_check_interval="5000"

<VERIFY_SUSPECT timeout="5000"

<pbcast.NAKACK2 xmit_interval="1000"

xmit_table_num_rows="50"

xmit_table_msgs_per_row="1024"

xmit_table_max_compaction_time="30000"

max_msg_batch_size="100"

resend_last_seqno="true"

<UNICAST3 xmit_interval="500"

xmit_table_num_rows="50"

xmit_table_msgs_per_row="1024"

xmit_table_max_compaction_time="30000"

max_msg_batch_size="100"

conn_expiry_timeout="0"

<pbcast.STABLE stability_delay="500"

desired_avg_gossip="5000"

max_bytes="1M"

<pbcast.GMS print_local_addr="false"

join_timeout="15000"

<UFC max_credits="2m"

min_threshold="0.40"

<MFC max_credits="2m"

min_threshold="0.40"

</config>

========

Issue two:

========

In production, even though the mcast_addr in the jgroup.xml is different between PP and prod, the clustered members are combined into one and share the cache. In other words, when the first cache manager in prod is started, all of its caches get loaded from the PP legs so it assumes that there is another cache manager up and running whereas it should see that it is the first cache manager and should take responsibility for loading the cache.

PP:

<UDP mcast_addr="${jgroups.udp.mcast_addr:226.125.1.1}"

mcast_port="${jgroups.udp.mcast_addr:46655}"

Prod:

<UDP mcast_addr="${jgroups.udp.mcast_addr:227.125.1.1}"

mcast_port="${jgroups.udp.mcast_addr:46655}"

Any idea why the clusters are combined?

Message was edited by: Mehdi Rakhshani

infinispanLeg1FileStore.log.zip 3.2 KB
infinispanLeg2FileStore.log.zip 2.8 KB
infinispanLeg1NoFileStore.log.zip 3.3 KB
infinispanLeg2NoFileStore.log.zip 2.8 KB
jgroupsSameForAll.xml.zip 891 bytes
jbossDataGridConfigurationCacheManagers.xml.zip 1,012 bytes
jbossDataGridConfigurationAppServers.xml.zip 1.1 KB

1. Re: Two issues: All cache is not propagated properly at startup, clusters combined even though they use different UPD addresses

wdfink Nov 18, 2015 7:33 AM (in response to mmr11408)

I suppose for the cluster (2. issue) you use a wrong configuration or it fall back to the default.
I assume you use embedded mode, could you attach the configuration files complete?
1 of 1 people found this helpful
Actions
2. Re: Two issues: All cache is not propagated properly at startup, clusters combined even though they use different UPD addresses

mmr11408 Nov 18, 2015 9:32 AM (in response to wdfink)

How do I attach them to this entry without creating a new append?
BTW, the full jgroup.xml config file is included in the original message.
Actions
3. Re: Two issues: All cache is not propagated properly at startup, clusters combined even though they use different UPD addresses

mmr11408 Dec 4, 2015 8:06 AM (in response to mmr11408)

Wolf-Dieter,

Thanks for the hint. You were correct, it was falling back to the default. I have modified the Infinispan config file properly so it is now using my jgroups.xml and not the default one in infinispan-core.jar. So issue 2 is resolved.

I have found significant information regarding issue 1. It occurs only when a cluster member that is keeping part of the cache in memory and the rest in a file store is up.

Cache Managers (CM) keep all of the records in memory while the App Servers (AP) keep 20000 records in memory and the rest on the file system. CM1 is brought up first which loads the cache. When AP1 and AP2 are down and CM2 is brought up, CM2 gets fully loaded from CM1’s cache. When AP1 is brought up it gets fully loaded from CM1 and/or CM2 cache. But when AP2 is brought up, its cache gets partially loaded.

If I run the same scenario as above but this time change AP1 and AP2 configuration so they keep all records in the cache (max-entries=2000000 instead of max-entries=20000), this time AP2’s cache is fully loaded. Hopefully this is some deficiency in my configuration files. I turned DEBUG level logging for org.infinispan and will attach the log from Ap1 and AP2 from the two runs, one with a file store being used (due to max-entries=20000) and one with a file store not being used because max-entries is set to a high enough number that all cache entries are kept in memory. I will also attach all of my configuration files.
Actions
4. Re: Two issues: All cache is not propagated properly at startup, clusters combined even though they use different UPD addresses

mmr11408 Dec 8, 2015 2:32 PM (in response to mmr11408)

Any idea if this is a deficiency in the configuration files or a bug introduced in Infinispan between version 6.x and 8.x?
Actions
5. Re: Two issues: All cache is not propagated properly at startup, clusters combined even though they use different UPD addresses

dan.berindei Dec 22, 2015 12:50 PM (in response to mmr11408)

This is most likely a configuration problem, the fetchPersistentState configuration attribute was renamed to fetch-state but AFAIK the default is false both in 6.x and 8.x.

Related answer: Re: L1 Cache in Replicated cache
Actions
6. Re: Two issues: All cache is not propagated properly at startup, clusters combined even though they use different UPD addresses

mmr11408 Dec 22, 2015 1:30 PM (in response to dan.berindei)

Thanks Dan. My configuration file has:
   <persistence passivation="true" >
      <file-store path="Infinispan-SingleFileCacheStore" fetch-state="false" purge="true" />
   </persistence>
   <state-transfer enabled="true" timeout="940000" chunk-size="18181" await-initial-transfer="true" />

I thought fetch-state refers to the state of the file-store (which I want ignored and purged during startup) rather than initial state of the cache. I will try that and will report back.
Actions

Go to original post