13 Replies Latest reply on Jul 6, 2010 5:07 PM by belaban

    Cluster Stability Issues in a Replicated Cache High Availability Use Case

    cbo_

      Hi,

       

      We have a use case involving high availability between 2 VMs operating on 2 separate machines and using a replication cache.  We have moved into some more advanced testing and are discovering some issues with the stability of the jgroups cluster. 

       

      The first issue occurred while cluster coordination was switching between the 2 VMs as we simulated bringing VMs down and back up.  The 2 VMs appeared to cooperate from one direction (while one was the coordinator), but not when the other became coordinator.  We did not notice at first, but later realized in our use of TCPPING for cluster discovery we had the IP address incorrect on one side in our jgroups file.  We recently re-IP'd this machine and forgot to go back and fix this since we had been using MPING before.  The main symptom here was that we were only successful if we first started the VM that had the correct TCPPING setting for the other VM (but not visa versa).  Once we fixed this we were able to start the VMs in either order and the cluster joins occurred correctly.  Then, we noticed we had the setting for  num_initial_members set to 1 which kept us from being able to switch cluster coordinator back and forth and keep the 2 VMs in the cluster (one VM would choose to isolate itself).  Set it to 2 and the cluster coordination and merging was working well.

       

      Until......

       

      We started to push a lot of work to our primary VM.  At some point that VM runs out of memory (old gen) and gets into at least one lengthy garbage collection (GC).  The logs seem to indicate that this was due to pings from the other VM not being responded to promptly by the VM stuck for a time in GC.  We modified our logging and set the level to TRACE for the below mentioned classes.  This logging details supported our theory.  The theory being that the efforts to ping from one VM to the VM in GC are beyond tolerance levels and the VM then isolates itself and becomes (in its mind) the coordinator. The concern we have is that it never recovers from this scenario despite the GC eventually finishing a bit later.  The (temp) solution to this issue was to set the ping tolerances above the duration of the expected "outage" from things like GC.  Other concerns would be network issues.  The ping settings that were changed are:

           <FD timeout="15000" max_tries="3"/>

       

      Any thoughts on why the cluster can never recover once the separation occurs?  The logs are pretty cumbersome right now and we are sifting through and may add those details tomorrow.  For now we can abbreviate the log details as follows (B is one VM and A indicates the other VM):

           1. B maintains a view of A and  B.

           2. During the GC, A declares B to be  dead.

           3. After the GC, a MERGE is  requested.

           4. A declares itself the MERGE  leader.

           5. The FLUSH fails because B is  discarding the request since it is not in A’s view.

           6. The MERGE  fails.

       

      And, as mentioned above, here are the classes that we set logging to TRACE while investigating this:

       

         <category name="org.jgroups.protocols.TCPPING">
            <priority value="TRACE"/>
         </category>
         <category name="org.jgroups.protocols.pbcast.GMS">
            <priority value="TRACE"/>
         </category>
         <category name="org.jgroups.protocols.FD_SOCK">
            <priority value="TRACE"/>
         </category>
         <category name="org.jgroups.protocols.pbcast.FLUSH">
            <priority value="TRACE"/>
         </category>
         <category name="org.jgroups.protocols.MERGE2">
            <priority value="TRACE"/>
         </category>
         <category name="org.jgroups.protocols.FD">
            <priority value="TRACE"/>
         </category>
         <category name="org.jgroups.protocols.VERIFY_SUSPECT">
            <priority value="TRACE"/>
         </category>

       

       

       

      Just in case it proves useful to anyone, the complete jgroups xml now has the following contents:

      <config xmlns="urn:org:jgroups"

              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

              xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">

          <TCP bind_port="7820"

              loopback="false"

              port_range="30"

              recv_buf_size="20000000"

              send_buf_size="640000"

              discard_incompatible_packets="true"

              max_bundle_size="64000"

              max_bundle_timeout="30"

              enable_bundling="false"

              tcp_nodelay="true"

              use_send_queues="true"

              sock_conn_timeout="300"

              enable_diagnostics="false"

              thread_pool.enabled="true"

              thread_pool.min_threads="2"

              thread_pool.max_threads="30"

              thread_pool.keep_alive_time="5000"

              thread_pool.queue_enabled="false"

              thread_pool.queue_max_size="100"

              thread_pool.rejection_policy="Discard"

              oob_thread_pool.enabled="true"

              oob_thread_pool.min_threads="2"

              oob_thread_pool.max_threads="30"

              oob_thread_pool.keep_alive_time="5000"

              oob_thread_pool.queue_enabled="false"

              oob_thread_pool.queue_max_size="100"

              oob_thread_pool.rejection_policy="Discard"

               />

          <TCPPING timeout="3000"
                   initial_hosts="${jgroups.tcpping.initial_hosts:200.137.253.44[7820]}"
                   port_range="1"
                   num_initial_members="2"/>
         <!--MPING receive_on_all_interfaces="true"  break_on_coord_rsp="true"
            mcast_addr="${jgroups.udp.mcast_addr:239.192.0.32}" mcast_port="${jgroups.udp.mcast_port:50220}" ip_ttl="${jgroups.udp.ip_ttl:16}"
            num_initial_members="2" num_ping_requests="1"/-->
         <MERGE2 max_interval="30000"
                 min_interval="10000"/>
         <FD_SOCK/>
         <FD timeout="3000" max_tries="3"/>
         <VERIFY_SUSPECT timeout="1500"/>
         <pbcast.NAKACK
               use_mcast_xmit="false" gc_lag="0"
               retransmit_timeout="300,600,1200,2400,4800"
               discard_delivered_msgs="false"/>
         <!--UNICAST timeout="300,600,1200"/-->
         <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                        max_bytes="400000"/>
         <pbcast.GMS print_local_addr="false" join_timeout="7000" view_bundling="true"/>
         <FC max_credits="2000000"
             min_threshold="0.10"/>
         <FRAG2 frag_size="60000"/>
         <pbcast.STREAMING_STATE_TRANSFER/>
         <!-- <pbcast.STATE_TRANSFER/> -->
         <pbcast.FLUSH/>
      </config>

        • 1. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
          mircea.markus
          The first issue occurred while cluster coordination was switching  between the 2 VMs as we simulated bringing VMs down and back up.  The 2  VMs appeared to cooperate from one direction (while one was the  coordinator), but not when the other became coordinator.  We did not  notice at first, but later realized in our use of TCPPING for cluster  discovery we had the IP address incorrect on one side in our jgroups  file.  We recently re-IP'd this machine and forgot to go back and fix  this since we had been using MPING before.  The main symptom here was  that we were only successful if we first started the VM that had the  correct TCPPING setting for the other VM (but not visa versa).  Once we  fixed this we were able to start the VMs in either order and the cluster  joins occurred correctly.  Then, we noticed we had the setting for   num_initial_members set to 1 which kept us from being able to switch  cluster coordinator back and forth and keep the 2 VMs in the cluster  (one VM would choose to isolate itself).  Set it to 2 and the cluster  coordination and merging was working well.

          If the instances are collocated or they have access to a shared disk, you can use FILE_PING insted of TCPPING- this way you won't depend on hardcoded discovery.

           

          ...that VM runs out of memory (old gen) and gets into at least one lengthy  garbage collection (GC).

          There are few things you can change to be more agressive with distributed garbage colector (STABLE in jgroups speak):

          - desired_avg_gossip reduce this to make garbage collection run more frequent, same with max bytes

          pbcast.NAKACK.discard_delivered_msgs - set this to true, which is safe as your cluster only has 2 members.

          These should reduce memory consumption.

          • 2. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
            shane_dev

            I can help clarify what we are seeing here...

             

            1. A and B are in the same view on both instances and B is the coordinator.
            2. B enters GC.
            3. A declares B dead.
            4. A installs a new view A: [A]. A is now the coordinator of its view.
            5. B leaves GC.
            6. B still considers A to be a part of its view B: [A, B] and that it is the coordinator.
            7. A sends a MERGE request to B.
            8. A is elected the MERGE leader.
            9. A sends a START_FLUSH to B.
            10. B ignores the START_FLUSH because it is not a participant. (This is odd. Even though the merge participants are A and B, the flush participants from A are only A itself.)
            11. B sends a START_FLUSH to A.
            12. A ignores FLUSH messags from B because it is not in the XMIT_TABLE since the view on A is A: [A].
            13. FLUSH from A fails because A never receives the appropriate START_FLUSH response from B.
            14. FLUSH from B fails because B is not in A's view.
            15. A sends CANCEL_MERGE to B.

             

            Thoughout this time B continues to try to flush its replication queue to A. It never suspects A and it believes A and B are in a normal, working view. However, A simply rejects these flush messages.

             

            It seems that the problem is the fact that B still considers A to be in its view while A has already removed B from its view.

             

            A: [A]

            B: [A, B]

             

            This MERGE/FLUSH/FAIL process simply goes round and round failing each time. The view state never changes for A or B.

            • 3. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
              vblagojevic

              I'll ask Bela to have a look as well. Until then can you please try to do another run with UNICAST enabled. You really need to have UNICAST in your tcp stack.

              • 4. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                shane_dev

                I was thinking the same thing. I did see some actions that appeared out of order. Hopefully we can make that change and get it tested this afternoon.

                • 5. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                  shane_dev

                  Actually, it turns out we had already put UNICAST back in despite what was copied in the earlier post. So, this problem is occurring regardless of UNICAST. In addition, we ran this test in another environment and we can duplicate this issue on demand.

                   

                  1. Start A and B
                  2. PSTOP B
                  3. Wait for A to declare B dead and make itself the coordinator.
                  4. PRUN B
                  5. Watch merge fail.

                   

                  This seems to be related to the FLUSH process once the MERGE starts. They both send a START_FLUSH and they both ignore each other's respones.

                  • 6. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                    vblagojevic

                    Shane,

                     

                    Ok lets get to the bottom of this one. Can you attach your final JGroups config that was used along with relevant logs. You can send logs directly to me and Bela if you find that to be more appropriate.

                     

                    Vladimir

                    • 7. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                      cbo_

                      I have attached the files.  In this case the machine that was running as master is the Bside.  I ran as Shane described earlier.  The outage was created by pstop'ing the process for a bit over a minute.  I then allowed the cluster to try to repair things for another 40 seconds beyond that.  You should see this in the logs.

                      • 8. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                        shane_dev

                        In case this helps, I have trimmed the log files down to what appears to be the most important statements and indented with markers at specific lines.

                         

                        My interpretation is that B tries twice to merge. In the first attempt, it tries 4 times to flush the data. The failure to flush results in cancelling the merge.

                         

                        For its part, A seems to respond by starting and completing its own flush with itself. It never really participates in the flush request from B.

                         

                        I'm a little confused by this log in A though, "got merge request from test9-59534, merge_id=test9-59534::1, mbrs=[test16-40270]". Shouldn't both members (9 an 16) be listed?

                        • 9. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                          vblagojevic

                          FYI, Bela and I are still looking to replicate this exact scenario. Even though the first merge might fail, B should eventually install a view {B} which in turn should enable the next merge to succeed. Will report back with more details tomorrow.

                           

                          Vladimir

                          • 10. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                            belaban

                            Yes, this is an issue: if A has {A,B} and B has {B} as views, then the flush will always fail on A, as A's flush on {A,B} will only get a response from itself and not from B, as B drops the multicast from A... This causes the flush to fail, and therefore the merge fails as well.

                             

                            I created a unit test in GMS_MergeTest (not commited yet), which passes without FLUSH and fails with FLUSH in the config.

                             

                            I'll need to think about this and discuss it with Vladimir, to see how we can fix this.

                             

                            As a workaround, can you remove FLUSH from your config ? This might be a problem though as I seem to remember that Infinispan requires FLUSH for some reason... Vladimir ?

                             

                            Bela

                            • 11. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                              mircea.markus
                              As a workaround, can you remove FLUSH from your config ? This might  be a problem though as I seem to remember that Infinispan requires FLUSH  for some reason... Vladimir ?

                              You're right, you cannot start a cluster witouth FLUSH there: This is related to https://jira.jboss.org/jira/browse/ISPN-83

                              • 12. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                                shane_dev

                                Can we set use_flush_if_present to false in pbcast.GMS? It appears that FLUSH could still be used for state transfer/join but that it would be ignored for view changes and hopefully merge as a result.

                                • 13. Re: Cluster Stability Issues in a Replicated Cache High Availability Use Case
                                  belaban

                                  Why don't you try it out ?

                                   

                                  Meanwhile, I've reopened https://jira.jboss.org/browse/JGRP-1061 and added a test case which passes without FLUSH and fails with FLUSH (GMS_MergeTest). I guess we have to come up with a better solution to handle flushing overlapping partitions before a merge...

                                   

                                  Bela