5 Replies Latest reply on Dec 28, 2005 4:15 PM by knatarajan

    TreeCache error when node leaves network

    knatarajan

      Hi all,
      Each developer in my team is running a single node cluster instance (jboss 4.0.1 on windows XP) on his/her machine by specifying a unique jboss partition name and a unique jboss cluster name. We can see messages on the jboss console that confirm that on each developer's instance that the number of cluster members=1. We also see messages on node A that its is discarding messages from Node B and the vice versa.

      However, when any machine in the network that is running such a jboss instance disconnects from the network, all other servers see the following exception, where is the name of the machine that left :

      19:28:43,300 INFO [STDOUT] CacheException while treeCache.put rsp=sender=:4387, retval=null, received=false, suspected=false
      19:28:43,300 INFO [STDOUT] org.jboss.cache.lock.TimeoutException: rsp=sender=:4387, retval=null, received=false, suspected=false
      19:28:43,300 INFO [STDOUT] at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2235)
      19:28:43,300 INFO [STDOUT] at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2257)
      19:28:43,300 INFO [STDOUT] at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:103)
      19:28:43,300 INFO [STDOUT] at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:3132)
      19:28:43,300 INFO [STDOUT] at org.jboss.cache.TreeCache.put(TreeCache.java:1812)
      19:28:43,300 INFO [STDOUT] at org.jboss.cache.TreeCache.put(TreeCache.java:1795)
      19:28:43,300 INFO [STDOUT] at edu.yale.its.tp.cas.ticket.ServiceTicketCache.storeTicket(ServiceTicketCache.java:139)
      19:28:43,300 INFO [STDOUT] at edu.yale.its.tp.cas.ticket.ActiveTicketCache.addTicket(ActiveTicketCache.java:33)
      19:28:43,300 INFO [STDOUT] at edu.yale.its.tp.cas.servlet.Login.grantForService(Login.java:201)
      19:28:43,300 INFO [STDOUT] at edu.yale.its.tp.cas.servlet.Login.doGet(Login.java:167)
      19:28:43,300 INFO [STDOUT] at edu.yale.its.tp.cas.servlet.Login.doPost(Login.java:86)
      19:28:43,300 INFO [STDOUT] at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

      Our cluster configuration is as follows :


      <!-- UDP: if you have a multihomed machine,
      set the bind_addr attribute to the appropriate NIC IP address, e.g bind_addr="192.168.0.2"
      -->
      <!-- UDP: On Windows machines, because of the media sense feature
      being broken with multicast (even after disabling media sense)
      set the loopback attribute to true -->
      <UDP mcast_addr="230.1.2.3" mcast_port="45577"
      ip_ttl="64" ip_mcast="true"
      mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
      ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
      loopback="false"/>
      <PING timeout="2000" num_initial_members="3"
      up_thread="false" down_thread="false"/>
      <MERGE2 min_interval="10000" max_interval="20000"/>
      <FD_SOCK/>
      <VERIFY_SUSPECT timeout="1500"
      up_thread="false" down_thread="false"/>
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
      max_xmit_size="8192" up_thread="false" down_thread="false"/>
      <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
      down_thread="false"/>
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="false" down_thread="false"/>
      <FRAG frag_size="8192"
      down_thread="false" up_thread="false"/>
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true"/>
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>




      Any pointers on why this would happen (even though each jboss is a different partition and different cluster) ?

      Thanks,
      K

        • 1. Re: TreeCache error when node leaves network
          belaban

          If you see msgs discarded from other nodes, then that means you *haven't* cleanly separated your clusters. By default JBoss starts 3 clusters, so make sure you separate *all* of them

          • 2. Re: TreeCache error when node leaves network
            knatarajan

            Thanks for your response.
            From what I understand from jboss cluster documentation, I need to separate the "partition" and "cluster". Pl clarify on what u mean by "3 clusters".
            This is how I start jboss :
            run -c all -Djboss.partition.name=<> -Djboss.cluster.name=<>

            • 3. Re: TreeCache error when node leaves network
              knatarajan

              I am posting the solution that I have put in for the problem posted here. I changed the cluster config to use instead of <FD_SOCK> and am no longer experiencing the timeout issue.
              The cluster documentation mentions that when FD_SOCK is used, a member is declared dead only when socket is closed. This was the reason why when a node was "unplugged" from the network without graceful shutdown, we were experiencing problems on the other nodes. By using , heart beat messages will be used for failure detection and hence node exits can be detected even without graceful shutdown.

              My config entry now uses :
              <FD shun="true" up_thread="true" down_thread="true"/>

              Bela, I am still interested in finding out what I am missing in terms of the clusters not being separate. Really appretiate any pointers you may have on that regard.

              Thanks!

              • 4. Re: TreeCache error when node leaves network
                brian.stansberry

                In the /server/all/deploy directory there is a cluster-service.xml file with a section that configures the JGroups protocol stack. It has a UDP element with attributes mcast_addr and mcast_port (multicast address and port). If different machines on the network have the same mcast_addr and mcast_port, your JGroups channel will see messages from the other machines. JGroups will see that the packets are intended for a cluster with a different name and will discard them, but it will complain about it. It's best to use a different mcast_addr and mcast_port for machines that are not meant to be part of the same cluster.

                When I was working in a team environment we generally gave each developer their own port number to use.

                This can be a bit of a pain if you have to edit the cluster-service.xml file to give each developer their own port. I've used system property substitution to manage this in 4.0.1SP1 and later; haven't tried it in 4.0.1. Looks like this:

                <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3} mcast_port="${jboss.partition.udpPort:45566} ...


                Then in your command line you use -D to set the jboss.partition.udpGroup and udpPort. Beginning in 4.0.3 you can use a -u command line switch to set the udpGroup.

                We then set an environment variable for the port on each machine and passed the values of the environment variables through the command line.

                Beginning with 4.0.1SP1 there is also a tc5-cluster-service.xml file that creates a JGroups channel used for http session replication; a similar procedure needs to be used for it as well.

                • 5. Re: TreeCache error when node leaves network
                  knatarajan

                  Brian, Thanks a lot !