9 Replies Latest reply on Oct 12, 2009 10:46 AM by manik

Clustered Cache Synchronization Issue

kringdahl Sep 28, 2009 11:33 AM

We are using POJO cache 3.0.0.GA with core cache 3.1.0.GA and are seeing an issue where cache nodes are out of sync with each other. We've been using JBoss Cache for quite some time now and this issue appears to be fairly new. The frequency by which this is happening has increased recently even though we have not changed the mentioned versions of the cache for a while. One interesting aspect of this is that it is only certainly keys of the cache and not the cache entirely that becomes out of sync. We have implemented a fairly ugly workaround whereby we evict nodes from the cache on a fairly frequent basis as the cache loader maintains the accurate state of the nodes and this has eased the problem since the caches come back into sync when the keys are reloaded. We can reproduce this problem on a somewhat regular basis, but we do not have a specific set of steps to make it happen. I have pasted my cache config below. If someone could give it a once over to make sure there aren't any misconfigurations, that would be much appreciated. Additionally, is there anything we can do when the clustered cache nodes are not synchronized to help get to the bottom of the issue? Thanks in advance!

<?xml version="1.0" encoding="UTF-8"?>
<jbosscache xmlns="urn:jboss:jbosscache-core:config:3.0">

 <!--
 isolation levels supported: READ_COMMITTED and REPEATABLE_READ
 nodeLockingSchemes: mvcc, pessimistic (deprecated), optimistic (deprecated)
 -->
 <locking isolationLevel="READ_COMMITTED" lockAcquisitionTimeout="15000" nodeLockingScheme="mvcc"/>

 <!--
 Used to register a transaction manager and participate in ongoing transactions.
 -->
 <transaction transactionManagerLookupClass="org.jboss.cache.transaction.GenericTransactionManagerLookup"/>

 <!--
 serialization related configuration, used for replication and cache loading
 -->
 <serialization useRegionBasedMarshalling="false"/>

 <!--
 This element specifies that the cache is clustered.
 modes supported: replication (r) or invalidation (i).
 -->
 <clustering mode="invalidation" clusterName="dtFabricCluster">

 <!--
 Network calls are synchronous.
 -->
 <sync replTimeout="120000"/>

 <!--
 Uncomment this for async replication.
 -->
 <!--<async useReplQueue="true" replQueueInterval="10000" replQueueMaxElements="500" serializationExecutorPoolSize="20" serializationExecutorQueueSize="5000000"/>-->

 <buddy enabled="false" poolName="myBuddyPoolReplicationGroup" communicationTimeout="2000">
 <dataGravitation auto="false" removeOnFind="true" searchBackupTrees="true"/>
 <locator class="org.jboss.cache.buddyreplication.NextMemberBuddyLocator">
 <properties>
 numBuddies = 1
 ignoreColocatedBuddies = true
 </properties>
 </locator>
 </buddy>

 <!--
 Defines whether to retrieve state on startup
 -->
 <stateRetrieval fetchInMemoryState="false" timeout="15000"/>

 <!--
 Configures the JGroups channel. Looks up a JGroups config file on the classpath or filesystem. udp.xml
 ships with jgroups.jar and will be picked up by the class loader.
 -->
 <jgroupsConfig>
 <UDP discard_incompatible_packets="true" enable_bundling="false"
 enable_diagnostics="true" ip_ttl="2"
 loopback="false" max_bundle_size="64000" max_bundle_timeout="30"
 mcast_addr="$$MCAST_ADDRESS$$" mcast_port="$$MCAST_PORT$$"
 mcast_recv_buf_size="25000000" mcast_send_buf_size="640000"
 oob_thread_pool.enabled="true" oob_thread_pool.keep_alive_time="10000"
 oob_thread_pool.max_threads="4" oob_thread_pool.min_threads="1"
 oob_thread_pool.queue_enabled="true" oob_thread_pool.queue_max_size="10"
 oob_thread_pool.rejection_policy="Run" thread_naming_pattern="pl"
 thread_pool.enabled="true" thread_pool.keep_alive_time="30000"
 thread_pool.max_threads="25" thread_pool.min_threads="1"
 thread_pool.queue_enabled="true" thread_pool.queue_max_size="10"
 thread_pool.rejection_policy="Run" tos="8" ucast_recv_buf_size="20000000"
 ucast_send_buf_size="640000" use_concurrent_stack="true"
 use_incoming_packet_handler="true"/>
 <PING num_initial_members="3" timeout="2000"/>
 <MERGE2 max_interval="30000" min_interval="10000"/>
 <FD_SOCK/>
 <FD max_tries="5" shun="true" timeout="10000"/>
 <VERIFY_SUSPECT timeout="1500"/>
 <pbcast.NAKACK discard_delivered_msgs="true" gc_lag="0"
 retransmit_timeout="300,600,1200,2400,4800" use_mcast_xmit="false"/>
 <UNICAST timeout="300,600,1200,2400,3600"/>
 <pbcast.STABLE desired_avg_gossip="50000" max_bytes="400000" stability_delay="1000"/>
 <AUTH auth_class="org.jgroups.auth.MD5Token" auth_value="desktone" token_hash="MD5"/>
 <pbcast.GMS join_timeout="5000" print_local_addr="true" shun="false"
 view_ack_collection_timeout="5000" view_bundling="true"/>
 <FRAG2 frag_size="60000"/>
 <pbcast.FLUSH timeout="0"/>
 </jgroupsConfig>

 </clustering>

 <eviction wakeUpInterval="5000">
 <!--
 Cache wide defaults
 default algorithmClass: if an algorithm class is not specified for a region, this one is used by default.
 default eventQueueSize if an event queue size is not specified for a region, this one is used by default.
 -->
 <default algorithmClass="org.jboss.cache.eviction.LRUAlgorithm" eventQueueSize="200000">
 <property name="maxAge" value="300000" />
 </default>

 <!-- Evict element objects every minute to account for element heart beats. -->
 <region name="/dht/element/inventory">
 <property name="maxAge" value="60000" />
 </region>

 <!-- Evict pool manager active node object every minute. -->
 <region name="/dht/fabric/poolmanager/activenode">
 <property name="maxAge" value="60000" />
 </region>

 </eviction>

 <!--
 Cache loaders.

 If passivation is enabled, state is offloaded to the cache loaders ONLY when evicted. Similarly, when the state
 is accessed again, it is removed from the cache loader and loaded into memory.

 Otherwise, state is always maintained in the cache loader as well as in memory.

 Set 'shared' to true if all instances in the cluster use the same cache loader instance, e.g., are talking to the
 same database.
 -->
 <loaders passivation="false" shared="true">
 <loader class="org.jboss.cache.loader.JDBCCacheLoader" async="false" fetchPersistentState="false" ignoreModifications="false" purgeOnStartup="false">
 <properties>
 cache.jdbc.table.name=dht
 cache.jdbc.table.primarykey=dht_pk
 cache.jdbc.table.create=false
 cache.jdbc.table.drop=false
 cache.jdbc.fqn.column=fqn
 cache.jdbc.fqn.type=varchar(255)
 cache.jdbc.node.column=value
 cache.jdbc.node.type=BYTEA
 cache.jdbc.parent.column=parent_fqn
 cache.jdbc.datasource=java:/jdbc/FabricDS
 cache.jdbc.sql-concat=1||2
 </properties>
 </loader>
 </loaders>

</jbosscache>

Note that MCAST_ADDRESS and MCAST_PORT are dynamically set at cache startup time and are consistent across the cache cluster. We store these values in the DB to ensure that the cluster is always consistent. Also note that we have tried both "replication" and "invalidation" clustered mode and this issue is reproducible in both.

1. Re: Clustered Cache Synchronization Issue

kringdahl Oct 1, 2009 5:13 PM (in response to kringdahl)

FWIW, we seem to have gotten to the bottom of this. We traced the issue back to our upgrade from core cache 3.0.3.GA to 3.1.0.GA. As soon as we downgraded, the problem went away. It was easily reproducible with 3.1.0. While there are some bug fixes and improvements in 3.1.0 that we could use (we are not using NBST), a functional clustered cache is paramount to all of that. It's certainly possible there was something about the combination of JBC 3.1.0 with the rest of our stack (JBoss AS 4.2.3, JBoss AOP 2.0 CR1).
Actions
2. Re: Clustered Cache Synchronization Issue

manik Oct 2, 2009 5:38 AM (in response to kringdahl)

Have you tried the recently-released 3.2.1.GA?
Actions
3. Re: Clustered Cache Synchronization Issue

kringdahl Oct 2, 2009 7:35 AM (in response to kringdahl)

We haven't tried it. If rolling back to 3.0.3 did not work, we were going to try rolling forward. But, we're at a point in our release where we need to mitigate risk and taking a new release would introduce risk.

In the grand scheme of things, we're re-evaluating our caching solution for obvious reasons. But, if we do try 3.2.1, I will post the findings back here. Looking towards the future, Infinispan is an option, but not a desirable one right now since it will be a while before the POJO cache is ready for prime time. Is there an ETA for the availability of an Infinispan POJO cache?
Actions
4. Re: Clustered Cache Synchronization Issue

manik Oct 2, 2009 7:39 AM (in response to kringdahl)

It depends on what you specifically need from POJO Cache. If it is fine-grained replication, a solution for that should be in place by early - mid next year (essentially following a JPA-like approach). If you'd like to help develop this, that would help accelerate it to an earlier release. :)
Actions
5. Re: Clustered Cache Synchronization Issue

kringdahl Oct 5, 2009 5:35 PM (in response to kringdahl)

Just a bit more info on the original problem here. We have seen a reoccurrance of the issue where it appears the cache nodes are out of sync with each other. However, this is looking more and more like a cache loader problem rather than a synchronization issue between nodes in the cluster. I have definitively seen cases where I attempt to fetch the value of a node in the cache and it is not returning to me what actually exists in the cache loader. When we turn cache preload off, this is quite reproducible. We have yet to see the problem when we have preload turned on. Are there any known bugs centered around the cache loader not being triggered when a cache miss is made for a particular key?
Actions
6. Re: Clustered Cache Synchronization Issue

manik Oct 12, 2009 8:40 AM (in response to kringdahl)

Not that I know of. Is this a shared cache loader?
Actions
7. Re: Clustered Cache Synchronization Issue

manik Oct 12, 2009 8:40 AM (in response to kringdahl)

And is the cache loader being run in async mode?
Actions
8. Re: Clustered Cache Synchronization Issue

kringdahl Oct 12, 2009 8:53 AM (in response to kringdahl)

It is a shared cache loader in sync mode
Actions
9. Re: Clustered Cache Synchronization Issue

manik Oct 12, 2009 10:46 AM (in response to kringdahl)

I know this may not be of much help to you, but could you try with a non-shared cache loader? Just to help pinpoint where the problem may lie?
Actions

Go to original post