8 Replies Latest reply on Sep 2, 2005 11:14 AM by belaban

    Clustering issue with JGroups on JBoss 3.2.5

    rcostanzo

      I'm running a cluster on JBoss 3.2.5 and am having issues with my instances joining. It seems to only happen when starting up a 3rd or 4th instance to join the cluster. For example, say I have instance A and B running already, and go to start instance C:

      1. serverA will acknowledge serverC as a member and update its cluster view properly
      2. serverB will spit out the following warning:
      2004-08-26 16:52:38,999 WARN [org.jgroups.protocols.pbcast.NAKACK] [serverC:32794 (additional data: 17 bytes)] discarded message from non-member serverC:32799 (additional data: 17 bytes)
      3. serverC will hang at this point

      Why does serverB think serverC is a non-member, when serverA is cool with serverC? And when serverB and serverA are in the same cluster...makes me thing that it's not a config issue.

      Is there any way to hardcode who your members are to avoid this issue? I found that the issue doesn't happen when using TCP rather than multicast, but it is way too slow in comparison (just clicking around as a single user on my site I saw the page load times go up 3 seconds or so).

      Any help/suggestions would be greatly appreciated. I've included the jgroups settings for one of my servers below:

      <UDP bind_addr="XX.XX.XX.XXX" mcast_addr="228.1.2.1" mcast_port="45566"
      ip_ttl="32" ip_mcast="true"
      mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
      ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
      loopback="false" />
      <PING timeout="2000" num_initial_members="4"
      up_thread="true" down_thread="true" />
      <MERGE2 min_interval="10000" max_interval="20000" />
      <FD shun="true" up_thread="true" down_thread="true"
      timeout="2500" max_tries="5" />
      <VERIFY_SUSPECT timeout="3000" num_msgs="3"
      up_thread="true" down_thread="true" />
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
      max_xmit_size="8192"
      up_thread="true" down_thread="true" />
      <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
      down_thread="true" />
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="true" down_thread="true" />
      <FRAG frag_size="8192"
      down_thread="true" up_thread="true" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true" />
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />

        • 1. Re: Clustering issue with JGroups on JBoss 3.2.5
          rcostanzo

          I also sometimes see serverC (the last server up) spit out this error after receiving the proper cluster view, but then still hang for a while:

          2004-08-26 17:20:54,393 INFO [org.jboss.ha.framework.interfaces.HAPartition.ProductionPartition] Fetching state (will wait for 60000 milliseconds):
          2004-08-26 17:20:54,401 INFO [ProductionPartition:ReplicantManager] Dead members: 0
          2004-08-26 17:21:21,926 ERROR [org.jgroups.protocols.pbcast.GMS] [mateso-webapp2:32819 (additional data: 17 bytes)] received view <= current view; discarding it (current vid: [mateso-webapp3:32796 (additional data: 17 bytes)|3], new vid: [mateso-webapp3:32796 (additional data: 17 bytes)|2])
          2004-08-26 17:21:22,469 WARN [org.jgroups.stack.UpHandler] UpHandler (STABLE) exception: java.lang.NullPointerException
          2004-08-26 17:21:22,469 ERROR [STDERR] java.lang.NullPointerException
          2004-08-26 17:21:22,470 ERROR [STDERR] at org.jgroups.protocols.pbcast.Digest.toString(Digest.java:374)
          2004-08-26 17:21:22,470 ERROR [STDERR] at java.lang.String.valueOf(String.java(Inlined Compiled Code))
          2004-08-26 17:21:22,470 ERROR [STDERR] at java.lang.StringBuffer.append(StringBuffer.java(Compiled Code))
          2004-08-26 17:21:22,470 ERROR [STDERR] at org.jgroups.protocols.pbcast.STABLE.handleStabilityMessage(STABLE.java:518)
          2004-08-26 17:21:22,470 ERROR [STDERR] at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:222)
          2004-08-26 17:21:22,471 ERROR [STDERR] at org.jgroups.stack.UpHandler.run(Protocol.java:59)

          • 2. Re: Clustering issue with JGroups on JBoss 3.2.5
            rcostanzo

            I've found some networking issues in the environment. serverC is in a physically different location that serverA and serverB (though logically on the same LAN). I just ran a test with tcpdump and found that serverC could transmit multicast packets to serverA and serverB, but those servers could NOT transmit multicast packets to serverC. I'll update if when this network issue is resolved my JBoss issue goes away.

            Also, I found that the pages can't load on serverA or serverB, since the multicast packet send to serverC is synchronous. Is there a setting to make this send asynchronous since I'm only using clustering for cache invalidation and with my schema it wouldn't be the end of the world if one or two messages were lost?

            • 3. Re: Clustering issue with JGroups on JBoss 3.2.5
              belaban

              Yes, you can make this asynchronous, depending on what you use. E.g. in HTTP session clustering you can switch from "instant" replication to "interval" replication.

              For the TCP-based config, check out jgroups.org -> User's Guide for an example

              Bela

              • 4. Re: Clustering issue with JGroups on JBoss 3.2.5
                rcostanzo

                So for entity EJB cache invalidation events, how do I make it asychronous? I see for the HTTP Session cache how to do it (in tomcat's config), and for the stateful session EJB cache (with the settings in jboss-web.xml), but can't find anywhere how to make the entity EJB cache invalidation events asynchronous (i.e. not take part in the same transaction as the actual DB update).

                • 5. Re: Clustering issue with JGroups on JBoss 3.2.5
                  belaban

                  1. If you want to use the InvalidationManager directly, all invalidateXXX() calls have a 'async' boolean parameter.

                  2. If you use the EntityBeanCacheBatchInvalidatorInterceptor, then all invalidations are synchronous by default (a.k.a hard-coded).

                  Bela

                  • 6. Re: Clustering issue with JGroups on JBoss 3.2.5
                    belaban

                    Sorry, incorrect information: you set the invalidation mode in the InvalidationManager MBean; it is called "AsynchronousInvalidation". By default it is *asynchronous* not synchronous.

                    Bela

                    • 7. Re: Clustering issue with JGroups on JBoss 3.2.5
                      rcostanzo

                      I did some debugging through the 3.2.7 code, and found that the invalidation manager operates in SYNCHRONOUS mode by default. In order to get it to work in Asynchronous mode, you need to change the invalidation manager mbean declaration at the beginning of the deploy/cache-invalidation-service.xml file to:

                       <mbean code="org.jboss.cache.invalidation.InvalidationManager"
                       name="jboss.cache:service=InvalidationManager">
                       <attribute name="IsAsynchByDefault">true</attribute>
                       </mbean>
                      


                      Has anybody tried out the invalidation manager using asynchronous mode? Are there any pitfalls I should know of before trying this out in my production environment? FYI, I really think this will solve my production issue of one JBoss hanging bringing down the whole cluster. It makes sense that if the invalidation is waiting for a response which it never gets, that each JBoss instance would grind to a halt.

                      Thanks.

                      -Rob

                      • 8. Re: Clustering issue with JGroups on JBoss 3.2.5
                        belaban

                        That would certainly work, because JGroups guarantees message delivery to all non-faulty members.
                        However, a Group RPC should never block forever (unless 0 is used as timeout): when a member crashes before the caller receives its reply, the list returned by the caller will have that member marked as 'suspected'.