2 Replies Latest reply on Jun 25, 2009 3:39 AM by mircea.markus

    jbosscache/jgroups replicated cache not working

    joachimm

      Using jbosscache-core 3.1.0.GA on jdk 1.6.

      We have a shared map of users that are logged in to our site. This is used by our back end services to send alerts to users if they are _not_ logged in. Each front end web app has a node id, and each user can have multiple sessions per node:

      /loggedInUsers/userId/nodeId/sessionId

      We hook into the servlet session listener lifecycle to remove sessions from the cache that have timed out, or when users log off. If a /userId/ node has no children, it is removed and the user is considered logged off of the site.

      We are currently deployed with a single front end and a single back end, both on the same linux machine.

      We have a number of problems:


      1. the front end and back end cache state is different- e.g. often the front end will have fewer entries than the back end

      2. the cache does not always reflect the correct state- sessions persist after they expire in the servlet container, and /userId/ nodes that have no remaining children are not cleaned up.

      3. the eviction policy does not clean up stale entries- we've defined a time to live of 30 minutes to match our session timeout on the front end. We see entries in the cache that are many hours old.

      4. we see membership warnings in the logs periodically even though processes are still running and we are running on one node only.



        In short, it is not working. I'm hoping that there is some configuration issue that someone can spot, or a different way of configuring jgroups that might help. I've included the config at the end.

        I also have a couple of questions:


        1. is there a way to get the last modified/accessed time of an entry through the API without touching the entry and thereby increasing it's TTL?

        2. is there a way to remove an Fqn via JMX?

        3. is there work that the application has to do with @NodeEvicted or other annotations to ensure that the cache state is the same on all nodes (I sort of thought this was the point of jbosscache).



          Thanks for any help you can provide. --Joachim


          Here are the errors we are seeing periodically:

          server.log.2009-06-18:2009-06-18 22:44:32,279 WARN [org.jgroups.protocols.FD] - <I was suspected by 127.0.0.1:59480; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK>
          server.log.2009-06-18:2009-06-18 22:44:32,279 WARN [org.jgroups.protocols.FD] - <I was suspected by 127.0.0.1:48930; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK>
          server.log.2009-06-18:2009-06-18 22:44:32,795 WARN [org.jgroups.protocols.pbcast.GMS] - <I (127.0.0.1:35383) am not a member of view [127.0.0.1:51683|6] [127.0.0.1:51683, 127.0.0.1:57843, 127.0.0.1:43144, 127.0.0.1:33294, 127.0.0.1:48930]; discarding view>
          server.log.2009-06-18:2009-06-18 22:44:32,971 WARN [org.jgroups.protocols.pbcast.GMS] - <I (127.0.0.1:51031) am not a member of view [127.0.0.1:59480|16] [127.0.0.1:59480]; discarding view>
          server.log.2009-06-18:2009-06-18 22:44:53,056 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:51683, 127.0.0.1:35383]>
          server.log.2009-06-18:2009-06-18 22:45:03,648 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:59480, 127.0.0.1:51031]>
          server.log.2009-06-18:2009-06-18 22:45:03,648 WARN [org.jgroups.protocols.pbcast.GMS] - <merge was supposed to be cancelled at merge participant 127.0.0.1:51031 (merge_id=[127.0.0.1:51031|1245365098648]), but it is not since merge ids do not match>
          server.log.2009-06-18:2009-06-18 22:45:08,605 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:51683, 127.0.0.1:35383]>
          server.log.2009-06-18:2009-06-18 22:45:08,609 WARN [org.jgroups.protocols.pbcast.GMS] - <merge was supposed to be cancelled at merge participant 127.0.0.1:35383 (merge_id=[127.0.0.1:35383|1245365103604]), but it is not since merge ids do not match>
          server.log.2009-06-18:2009-06-18 22:45:11,269 WARN [org.jgroups.protocols.pbcast.GMS] - <GMS flush by coordinator at 127.0.0.1:35383 failed>
          server.log.2009-06-18:2009-06-18 22:45:11,269 WARN [org.jgroups.protocols.pbcast.GMS] - <Since flush failed at 127.0.0.1:35383 rejected merge to 127.0.0.1:35383, merge_id=[127.0.0.1:35383|1245365088055]>
          server.log.2009-06-18:2009-06-18 22:45:11,269 ERROR [org.jgroups.protocols.pbcast.GMS] - <merge_id ([127.0.0.1:35383|1245365088055]) or this.merge_id (null) is null (sender=127.0.0.1:35383).>
          server.log.2009-06-18:2009-06-18 22:45:23,157 WARN [org.jgroups.protocols.pbcast.GMS] - <GMS flush by coordinator at 127.0.0.1:51031 failed>
          server.log.2009-06-18:2009-06-18 22:45:23,157 WARN [org.jgroups.protocols.pbcast.GMS] - <resume([127.0.0.1:51031|1245365098648]) does not match [127.0.0.1:51031|1245365122421]>
          server.log.2009-06-18:2009-06-18 22:45:23,157 WARN [org.jgroups.protocols.pbcast.GMS] - <Since flush failed at 127.0.0.1:51031 rejected merge to 127.0.0.1:51031, merge_id=[127.0.0.1:51031|1245365098648]>
          server.log.2009-06-18:2009-06-18 22:45:23,169 ERROR [org.jgroups.protocols.pbcast.GMS] - <this.merge_id ([127.0.0.1:51031|1245365122421]) is different from merge_id ([127.0.0.1:51031|1245365098648])>
          server.log.2009-06-18:2009-06-18 22:45:24,101 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:51683, 127.0.0.1:35383]>
          server.log.2009-06-18:2009-06-18 22:45:26,006 WARN [org.jgroups.protocols.pbcast.GMS] - <GMS flush by coordinator at 127.0.0.1:35383 failed>
          server.log.2009-06-18:2009-06-18 22:45:26,006 WARN [org.jgroups.protocols.pbcast.GMS] - <Since flush failed at 127.0.0.1:35383 rejected merge to 127.0.0.1:35383, merge_id=[127.0.0.1:35383|1245365103604]>
          server.log.2009-06-18:2009-06-18 22:45:26,006 ERROR [org.jgroups.protocols.pbcast.GMS] - <merge_id ([127.0.0.1:35383|1245365103604]) or this.merge_id (null) is null (sender=127.0.0.1:35383).>
          server.log.2009-06-18:2009-06-18 22:45:27,422 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:59480, 127.0.0.1:51031]>
          server.log.2009-06-18:2009-06-18 22:45:27,422 WARN [org.jgroups.protocols.pbcast.GMS] - <merge was supposed to be cancelled at merge participant 127.0.0.1:51031 (merge_id=[127.0.0.1:51031|1245365122421]), but it is not since merge ids do not match>
          server.log.2009-06-18:2009-06-18 22:45:33,806 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:51031 not found in xmit_table>
          server.log.2009-06-18:2009-06-18 22:45:33,806 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
          server.log.2009-06-18:2009-06-18 22:45:33,806 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:59480 not found in xmit_table>
          server.log.2009-06-18:2009-06-18 22:45:33,806 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:35383 not found in xmit_table>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:51683 not found in xmit_table>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:57843 not found in xmit_table>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:43144 not found in xmit_table>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:33294 not found in xmit_table>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:48930 not found in xmit_table>
          server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
          server.log.2009-06-19:2009-06-19 03:47:12,927 WARN [org.jgroups.protocols.pbcast.GMS] - <I (127.0.0.1:40908) am not a member of view [127.0.0.1:59480|18] [127.0.0.1:59480]; discarding view>
          server.log.2009-06-19:2009-06-19 03:47:12,931 WARN [org.jgroups.protocols.FD] - <I was suspected by 127.0.0.1:59480; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK>
          server.log.2009-06-19:2009-06-19 03:47:12,951 WARN [org.jgroups.protocols.pbcast.GMS] - <I (127.0.0.1:44101) am not a member of view [127.0.0.1:51683|8] [127.0.0.1:51683, 127.0.0.1:57843, 127.0.0.1:43144, 127.0.0.1:33294, 127.0.0.1:48930]; discarding view>
          server.log.2009-06-19:2009-06-19 03:47:12,995 WARN [org.jgroups.protocols.FD] - <I was suspected by 127.0.0.1:48930; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK>



          Here's the config:

          
          <jbosscache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:jboss:jbosscache-core:config:3.0">
          
           <!-- Configure the TransactionManager -->
           <transaction transactionManagerLookupClass="org.jboss.cache.transaction.GenericTransactionManagerLookup"/>
          
           <clustering mode="replication" clusterName="loggedInUsers">
           <!-- JGroups protocol stack properties. -->
           <!-- changed addr and port (to isolate channels) -->
           <jgroupsConfig>
           <UDP discard_incompatible_packets="true" enable_bundling="false" enable_diagnostics="false" ip_ttl="2"
           loopback="false" max_bundle_size="64000" max_bundle_timeout="30" mcast_addr="228.10.10.11"
           mcast_port="45589" mcast_recv_buf_size="25000000" mcast_send_buf_size="640000"
           oob_thread_pool.enabled="true" oob_thread_pool.keep_alive_time="10000" oob_thread_pool.max_threads="4"
           oob_thread_pool.min_threads="1" oob_thread_pool.queue_enabled="true" oob_thread_pool.queue_max_size="10"
           oob_thread_pool.rejection_policy="Run" thread_naming_pattern="pl" thread_pool.enabled="true"
           thread_pool.keep_alive_time="30000" thread_pool.max_threads="25" thread_pool.min_threads="1"
           thread_pool.queue_enabled="true" thread_pool.queue_max_size="10" thread_pool.rejection_policy="Run"
           tos="8" ucast_recv_buf_size="20000000" ucast_send_buf_size="640000" use_concurrent_stack="true"
           use_incoming_packet_handler="true"/>
           <PING num_initial_members="3" timeout="2000"/>
           <MERGE2 max_interval="30000" min_interval="10000"/>
           <FD_SOCK/>
           <FD max_tries="5" shun="true" timeout="10000"/>
           <VERIFY_SUSPECT timeout="1500"/>
           <pbcast.NAKACK discard_delivered_msgs="true" gc_lag="0" retransmit_timeout="300,600,1200,2400,4800"
           use_mcast_xmit="false"/>
           <UNICAST timeout="300,600,1200,2400,3600"/>
           <pbcast.STABLE desired_avg_gossip="50000" max_bytes="400000" stability_delay="1000"/>
           <pbcast.GMS join_timeout="5000" print_local_addr="true" shun="false" view_ack_collection_timeout="5000"
           view_bundling="true"/>
           <FRAG2 frag_size="60000"/>
           <pbcast.STREAMING_STATE_TRANSFER/>
           <pbcast.FLUSH timeout="0"/>
           </jgroupsConfig>
          
           <sync replTimeout="20000"/>
           <!-- Alternatively, to use async replication, comment out the element above and uncomment the element below. -->
           <!--<async />-->
          
           </clustering>
          
           <eviction wakeUpInterval="60000">
           <region name="/loggedInUsers" algorithmClass="org.jboss.cache.eviction.LRUAlgorithm" eventQueueSize="200000">
           <property name="timeToLive" value="1800001" />
           </region>
           </eviction>
          
          </jbosscache>
          
          


          Thanks!


        • 1. Re: jbosscache/jgroups replicated cache not working
          joachimm

          I should mention that we have another cache with a separate but similar config, that has only one writer and several of the readers have inconsistent views of the cache.

          • 2. Re: jbosscache/jgroups replicated cache not working
            mircea.markus

            Logs indicate that a member gets suspected by FD (Failure Detection). This might happen for various reasons, e.g. network issues or members being overloaded and freezing for long periods of time, hence not being able to respond to hart beat messages. A common scenario for latter situation is that there are a lot of writes on a/some nodes and this way another node on which replication takes place gets overwhelmed; this can be fixed by using flow control in the jgroups stack:

            <FC max_credits="500000" min_threshold="0.20"/>


            is there a way to get the last modified/accessed time of an entry through the API without touching the entry and thereby increasing it's TTL?

            No. The entries are being processed by the eviction thread, and all the access information is being logged (TRACE)
            is there a way to remove an Fqn via JMX?

            no
            is there work that the application has to do with @NodeEvicted or other annotations to ensure that the cache state is the same on all nodes (I sort of thought this was the point of jbosscache).

            @NodeEvicted has nothing to do with cluster consistency. One thing that can be optimized re:eviction is to decrease the wake up time to run more frequently - that's only if a node gets overwhelmed quick and you run out of memory. I suggest you monitor that first: nodes getting 'full' and freezing because of that, as all the effects you describe here might be explained this way.