Clustered Cache Synchronization Issue
kringdahl Sep 28, 2009 11:33 AMWe are using POJO cache 3.0.0.GA with core cache 3.1.0.GA and are seeing an issue where cache nodes are out of sync with each other. We've been using JBoss Cache for quite some time now and this issue appears to be fairly new. The frequency by which this is happening has increased recently even though we have not changed the mentioned versions of the cache for a while. One interesting aspect of this is that it is only certainly keys of the cache and not the cache entirely that becomes out of sync. We have implemented a fairly ugly workaround whereby we evict nodes from the cache on a fairly frequent basis as the cache loader maintains the accurate state of the nodes and this has eased the problem since the caches come back into sync when the keys are reloaded. We can reproduce this problem on a somewhat regular basis, but we do not have a specific set of steps to make it happen. I have pasted my cache config below. If someone could give it a once over to make sure there aren't any misconfigurations, that would be much appreciated. Additionally, is there anything we can do when the clustered cache nodes are not synchronized to help get to the bottom of the issue? Thanks in advance!
<?xml version="1.0" encoding="UTF-8"?> <jbosscache xmlns="urn:jboss:jbosscache-core:config:3.0"> <!-- isolation levels supported: READ_COMMITTED and REPEATABLE_READ nodeLockingSchemes: mvcc, pessimistic (deprecated), optimistic (deprecated) --> <locking isolationLevel="READ_COMMITTED" lockAcquisitionTimeout="15000" nodeLockingScheme="mvcc"/> <!-- Used to register a transaction manager and participate in ongoing transactions. --> <transaction transactionManagerLookupClass="org.jboss.cache.transaction.GenericTransactionManagerLookup"/> <!-- serialization related configuration, used for replication and cache loading --> <serialization useRegionBasedMarshalling="false"/> <!-- This element specifies that the cache is clustered. modes supported: replication (r) or invalidation (i). --> <clustering mode="invalidation" clusterName="dtFabricCluster"> <!-- Network calls are synchronous. --> <sync replTimeout="120000"/> <!-- Uncomment this for async replication. --> <!--<async useReplQueue="true" replQueueInterval="10000" replQueueMaxElements="500" serializationExecutorPoolSize="20" serializationExecutorQueueSize="5000000"/>--> <buddy enabled="false" poolName="myBuddyPoolReplicationGroup" communicationTimeout="2000"> <dataGravitation auto="false" removeOnFind="true" searchBackupTrees="true"/> <locator class="org.jboss.cache.buddyreplication.NextMemberBuddyLocator"> <properties> numBuddies = 1 ignoreColocatedBuddies = true </properties> </locator> </buddy> <!-- Defines whether to retrieve state on startup --> <stateRetrieval fetchInMemoryState="false" timeout="15000"/> <!-- Configures the JGroups channel. Looks up a JGroups config file on the classpath or filesystem. udp.xml ships with jgroups.jar and will be picked up by the class loader. --> <jgroupsConfig> <UDP discard_incompatible_packets="true" enable_bundling="false" enable_diagnostics="true" ip_ttl="2" loopback="false" max_bundle_size="64000" max_bundle_timeout="30" mcast_addr="$$MCAST_ADDRESS$$" mcast_port="$$MCAST_PORT$$" mcast_recv_buf_size="25000000" mcast_send_buf_size="640000" oob_thread_pool.enabled="true" oob_thread_pool.keep_alive_time="10000" oob_thread_pool.max_threads="4" oob_thread_pool.min_threads="1" oob_thread_pool.queue_enabled="true" oob_thread_pool.queue_max_size="10" oob_thread_pool.rejection_policy="Run" thread_naming_pattern="pl" thread_pool.enabled="true" thread_pool.keep_alive_time="30000" thread_pool.max_threads="25" thread_pool.min_threads="1" thread_pool.queue_enabled="true" thread_pool.queue_max_size="10" thread_pool.rejection_policy="Run" tos="8" ucast_recv_buf_size="20000000" ucast_send_buf_size="640000" use_concurrent_stack="true" use_incoming_packet_handler="true"/> <PING num_initial_members="3" timeout="2000"/> <MERGE2 max_interval="30000" min_interval="10000"/> <FD_SOCK/> <FD max_tries="5" shun="true" timeout="10000"/> <VERIFY_SUSPECT timeout="1500"/> <pbcast.NAKACK discard_delivered_msgs="true" gc_lag="0" retransmit_timeout="300,600,1200,2400,4800" use_mcast_xmit="false"/> <UNICAST timeout="300,600,1200,2400,3600"/> <pbcast.STABLE desired_avg_gossip="50000" max_bytes="400000" stability_delay="1000"/> <AUTH auth_class="org.jgroups.auth.MD5Token" auth_value="desktone" token_hash="MD5"/> <pbcast.GMS join_timeout="5000" print_local_addr="true" shun="false" view_ack_collection_timeout="5000" view_bundling="true"/> <FRAG2 frag_size="60000"/> <pbcast.FLUSH timeout="0"/> </jgroupsConfig> </clustering> <eviction wakeUpInterval="5000"> <!-- Cache wide defaults default algorithmClass: if an algorithm class is not specified for a region, this one is used by default. default eventQueueSize if an event queue size is not specified for a region, this one is used by default. --> <default algorithmClass="org.jboss.cache.eviction.LRUAlgorithm" eventQueueSize="200000"> <property name="maxAge" value="300000" /> </default> <!-- Evict element objects every minute to account for element heart beats. --> <region name="/dht/element/inventory"> <property name="maxAge" value="60000" /> </region> <!-- Evict pool manager active node object every minute. --> <region name="/dht/fabric/poolmanager/activenode"> <property name="maxAge" value="60000" /> </region> </eviction> <!-- Cache loaders. If passivation is enabled, state is offloaded to the cache loaders ONLY when evicted. Similarly, when the state is accessed again, it is removed from the cache loader and loaded into memory. Otherwise, state is always maintained in the cache loader as well as in memory. Set 'shared' to true if all instances in the cluster use the same cache loader instance, e.g., are talking to the same database. --> <loaders passivation="false" shared="true"> <loader class="org.jboss.cache.loader.JDBCCacheLoader" async="false" fetchPersistentState="false" ignoreModifications="false" purgeOnStartup="false"> <properties> cache.jdbc.table.name=dht cache.jdbc.table.primarykey=dht_pk cache.jdbc.table.create=false cache.jdbc.table.drop=false cache.jdbc.fqn.column=fqn cache.jdbc.fqn.type=varchar(255) cache.jdbc.node.column=value cache.jdbc.node.type=BYTEA cache.jdbc.parent.column=parent_fqn cache.jdbc.datasource=java:/jdbc/FabricDS cache.jdbc.sql-concat=1||2 </properties> </loader> </loaders> </jbosscache>
Note that MCAST_ADDRESS and MCAST_PORT are dynamically set at cache startup time and are consistent across the cache cluster. We store these values in the DB to ensure that the cluster is always consistent. Also note that we have tried both "replication" and "invalidation" clustered mode and this issue is reproducible in both.