2 Replies Latest reply on Dec 14, 2010 12:00 PM by banksider

    jBoss 4.3 NAKACK sent_items explodes

    banksider

      Hi folks

       

      I'm facing a problem with our jBoss 4.3 cluster.  3 or 4 nodes.  Occasionally, with no obvious trigger, the sent_items TreeMap in NAKACK grows uncontrollably in all nodes, and eventually a node will die with OutOfMemoryErrors.  Normally watching it in jConsole sent_items bumps along at 5-20 items.  Attached is a screenshot of it hitting 40,000.  In this case we shot one of the nodes, and all other nodes immediately reclaimed everything in sent_items and things returned to normal.

       

      It looks like the jGroups version is 2.4.5GA.

       

      We don't think it's hardware related, it happens on multiple nodes, and restarting jBoss stops the problems - we might not see it again for a couple of weeks (with jBoss being bounced twice weekly).

       

      Any ideas?  A configuration problem?

       

      Thanks

       

      Guy

       

      jboss_sent_items.jpg

       

      Here's (what I hope is) the basic clustering config

       

      <mbean code="org.jboss.ha.framework.server.ClusterPartition"
            name="jboss:service=${jboss.cluster.partitionname}">

       

             <!-- Name of the partition being built -->
            <attribute name="PartitionName">${jboss.cluster.partitionname}</attribute>

       

             <!-- The address used to determine the node name -->
            <attribute name="NodeAddress">${jboss.bind.address}</attribute>

       

             <!-- Determine if deadlock detection is enabled -->
            <attribute name="DeadlockDetection">False</attribute>

       

             <!-- Max time (in ms) to wait for state transfer to complete. Increase for large states -->
            <attribute name="StateTransferTimeout">30000</attribute>

       

             <!-- The JGroups protocol configuration -->
            <attribute name="PartitionConfig">
               <!--
               The default UDP stack:
               - If you have a multihomed machine, set the UDP protocol's bind_addr attribute to the
               appropriate NIC IP address, e.g bind_addr="192.168.0.2".
               - On Windows machines, because of the media sense feature being broken with multicast
               (even after disabling media sense) set the UDP protocol's loopback attribute to true
               -->
               <Config>
                  <UDP mcast_addr="${jboss.cluster.multicastaddress}"
                       mcast_port="${jboss.cluster.multicastport}"
                       tos="8"
                       ucast_recv_buf_size="20000000"
                       ucast_send_buf_size="640000"
                       mcast_recv_buf_size="25000000"
                       mcast_send_buf_size="640000"
                       loopback="false"
                       discard_incompatible_packets="true"
                       enable_bundling="false"
                       max_bundle_size="64000"
                       max_bundle_timeout="30"
                       use_incoming_packet_handler="true"
                       use_outgoing_packet_handler="false"
                       ip_ttl="${jgroups.udp.ip_ttl:10}"
                       down_thread="false" up_thread="false"/>
                  <PING timeout="2000"
                        down_thread="false" up_thread="false" num_initial_members="3"/>
                  <MERGE2 max_interval="100000"
                          down_thread="false" up_thread="false" min_interval="20000"/>

      <FD_SOCK down_thread="false" up_thread="false"/>
                  <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
                  <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
                  <pbcast.NAKACK max_xmit_size="60000"
                          use_mcast_xmit="false" gc_lag="0"
                          retransmit_timeout="300,600,1200,2400,4800"
                          down_thread="false" up_thread="false"
                          discard_delivered_msgs="true"/>
                  <UNICAST timeout="300,600,1200,2400,3600"
                          down_thread="false" up_thread="false"/>
                  <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                          down_thread="false" up_thread="false"
                          max_bytes="400000"/>
                  <pbcast.GMS print_local_addr="true" join_timeout="3000"
                          down_thread="false" up_thread="false"
                          join_retry_timeout="2000" shun="true"
                          view_bundling="true"/>
                  <FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
                  <pbcast.STATE_TRANSFER down_thread="false" up_thread="false" use_flush="false"/>
               </Config>

      </attribute>

            <depends>jboss:service=Naming</depends>

         </mbean>

        • 1. Re: jBoss 4.3 NAKACK sent_items explodes
          belaban

          Hard to day. Your STABLE protocol runs every 50 seconds, or whenever 400KB have accumulated. It could be that sent_table grows to 40'000 (that *would* be possible with 10byte messages) and is then purged by STABLE...

           

          You could try to reduce STABLE.max_bytes to 40000, but that's going to generate more traffic...

           

          The other option is to replace JGroups 2.4.x with 2.6.x, should be backwards compatible

          • 2. Re: jBoss 4.3 NAKACK sent_items explodes
            banksider

            Hi Bela.

             

            Thanks for the (very!) swift response.

             

            This kind of thing tends to build up over the course of hours, or occasionally days, in a busy cluster with what I assume must be large messages (badly written webapp - gigantic objects being put in the HTTP session).  It looks like I've neglected to mention that the app runs with a 3 Gb heap, and when it goes down the TreeMap$Entry objects in sent_items are consuming about 2.5Gb (from analysis of the resulting heap dump).  Basically it's a leak, but potentially something that may be brought about by a config problem or hardware (looks less likely).

             

            The one thing I'm wondering about (I know very little about jBoss clustering config) is the significance of the two MBean entries under jboss.groups.protocol - we appear to have [partition name] and Tomcat-[partition-name] - is this normal?  It is the Tomcat- version that misbehaves.

             

            Regards,

             

            Guy