7 Replies Latest reply on Mar 10, 2015 2:25 AM by hchiorean

    Problems with JGroups loopback in clustered environment

    javagirlie

      Hello modeshape team,

       

      since upgrading to modeshape 4.2 we're getting the following error in our clustered environment:

       

      [org.modeshape.jcr.bus.ClusteredChangeBus] - <Loopback changeset '0e6258e4-a736-4517-b5f3-fc9e77128a47' was never received back on 'ClusteringService[cluster_name='persistence', address=...]'. Make sure your JGroups configuration uses 'loopback=true' and if applicable 'loopback_separate_thread=true'>

       

      Our JGroups config originally contained the entry loopback="true", loopback_separate_thread wasn't configured (it even cannot be configured with jgroups-3.4.3.Final and should not be used anyway as the JGroups manual says).

       

      Taking a look into the code of ClusteredChangeBus we found the link to [MODE-2409], saying that loopback should NOT be disabled to ensure the correct order of the events.

      But even with loopback enabled, we get the above error.

       

      Looking at the JGroups code (org.jgroups.protocols.TP) we're a little confused by the comment of the loopback property.

       

          /**

           * If true, messages sent to self are treated specially: unicast messages are looped back immediately,

           * multicast messages get a local copy first and - when the real copy arrives - it will be discarded. Useful for

           * Window media (non)sense

           */

          @Property(description="Messages to self are looped back immediately if true")

          protected boolean loopback=true;

       

      According to this, multicast messages are copied locally first, whereas the real message will be discarded as soon as it arrives.

      As stated in ClusteredChangeBus.notify, modeshape is waiting until JGroups has dispatched it's own message back - but will it be waiting for the 'right' message as JGroups is going to dispatch a copy instead of the original?

       

      We got rid of the above error message by setting loopback to false in our clustered environment but in fact, we're not really confident with this solution, as [MODE-2409] states that loopback should not be disabled.

       

      Do you have any suggestions what else could be the reason for the error stated above and tips how we can get rid of it other than disabling 'loopback'? Or is disabling 'loopback' not as dangerous as it seems?

       

      Thanks in advance, Susanne.

        • 1. Re: Problems with JGroups loopback in clustered environment
          hchiorean

          Loopback should never be disabled because without it the cluster may end up in inconsitent state.

          Whenever changes are dispatched in a cluster, the current code waits for maximum 15 seconds for the loopback event to return, before printing out the warning. That is a sign that the loopback event was not received within 15 seconds, which would indicate to me that something is broken - loopback messages should always be received on the origin node afaik.

          If you can reproduce this warning in a test case, please open a JIRA. Thanks.

          • 2. Re: Problems with JGroups loopback in clustered environment
            hchiorean

            I've reopened MODE-2409 and if you read my latest comment: [MODE-2409] invoking of the checkin() method in clustered mode throws an org.modeshape.jcr.cache.NodeNotFoundInParentEx… just making sure those settings are present in your transport protocol is enough.

            So for Modeshape 4.3 we'll actually revert the changes made for MODE-2409. In the meantime you should just ignore the error message.

            • 3. Re: Problems with JGroups loopback in clustered environment
              javagirlie

              Thanks a lot for your answer, we'll change those two properties in our JGroups configuration.

              • 4. Re: Problems with JGroups loopback in clustered environment
                javagirlie

                We've changed the following two JGroups properties as suggested:

                     thread_pool.enabled="false"

                     oob_thread_pool.enabled="false"

                 

                Unfortunately we are still experiencing troubles in our clustered environment because we're running into (webserver) timeouts now due to the 15-sec waiting period in ClusteredChangeBus. So simply ignoring the error message is not working for us. For now, we've decided to downgrade to modeshape 4.1.


                Is there already a scheduled release date for modeshape 4.3?

                • 5. Re: Problems with JGroups loopback in clustered environment
                  hchiorean

                  No, there isn't a schedule. I would also suggest you investigate what's happening in your environment with your JGroups configuration, because disabling the tread_pool means JGroups will loopback the message immediately on the same thread where it was posted from.

                  The fact that you're experiencing 15 second timeouts indicates IMO that JGroups isn't working correctly in your environment. There isn't a bug in the current waiting code, it's just that it's unnecessary which is why I reopened the JIRA.

                  • 6. Re: Problems with JGroups loopback in clustered environment
                    javagirlie

                    We've gone through our JGroups config again but could't find any noteworthy differences to the sample config.

                    In fact, we even tried the sample config as stated in the ModeShape 3 documentation (https://docs.jboss.org/author/display/MODE/ModeShape+in+Java+applications#ModeShapeinJavaapplications-ClusternameandJGroups) but got the same errors.

                     

                    While debugging the notify and consume methods of ClusteredChangeBus on two tomcat instances on the same local machine we found out that the loopback message is received and consumed immediately after sending the message (on the instance it was sent from), even before the changeset is added to the map of loopbackLatches.

                    So when consume is called the loopbackLatches map is still empty.

                    After the sendMessage call is completed the changeset is added to the map of loopbackLatches, waiting 15 seconds for the message to be consumed but as the message has already been consumed,

                    we are getting the error indicating the loopback changeset was never received.

                     

                    (1) ClusteredChangeBus.notify  -> clusteringService.sendMessage

                    (2) ClusteredChangeBus.consume -> loopbackLatches == empty

                    (3) ClusteredChangeBus.notify  -> loopbackLatches.put(uuid, waitForLoopback)

                    (4) ClusteredChangeBus.notify  -> waitForLoopback.await(15, TimeUnit.SECONDS)

                     

                    We've now tried to modify the order of the calls in the notify method so the changeset is added to the loopbackLatches before the sendMessage call - and everything works fine.

                     

                    public void notify( ChangeSet changeSet ) {
                      ...
                      // Since JGroups will always dispatch incoming message off a separate thread, we need to wait here until JGroups
                      // has dispatched our own message back. This is because we have the semantics of in-thread listeners which must
                      // work correctly. See MODE-2409 for details.
                      CountDownLatch waitForLoopback = new CountDownLatch(1);
                      String uuid = changeSet.getUUID();
                      loopbackLatches.put(uuid, waitForLoopback);
                    
                      // There are multiple participants in the cluster, so send all changes out to JGroups,
                      // letting JGroups do the ordering of messages...
                      // note that JGroups will dispatch our own changeset *in a separate thread* (see below)
                      logSendOperation(changeSet);
                      clusteringService.sendMessage(changeSet);
                      try {
                        if (LOGGER.isTraceEnabled()) {
                          LOGGER.trace("Waiting for loopback changeset {0} to be received back on {1}", uuid, clusteringService.toString());
                        }
                        if (!waitForLoopback.await(15, TimeUnit.SECONDS)) {
                          loopbackLatches.remove(uuid);
                          LOGGER.error(BusI18n.loopbackMessageNotReceived, uuid, clusteringService.toString());
                        }
                      } catch (InterruptedException e) {
                        loopbackLatches.remove(uuid);
                        Thread.interrupted();
                        LOGGER.error(e, BusI18n.loopbackMessageNotReceived, uuid, clusteringService.toString());
                      }
                    }
                    
                    
                    • 7. Re: Problems with JGroups loopback in clustered environment
                      hchiorean

                      nice catch (bug), thanks for investigating. This code will go away in 4.3.