5 Replies Latest reply on Jan 21, 2013 10:48 AM by clebert.suconic

    2.2.5.Final sticks at 100% cpu when failback under load

    davidmac

      When performing failure mode tests for a production deployment, we have run into a problem with failback.

       

      We have 2 vms, jms1 primary and jms2 backup, with a shared storage configuration on GFS2 against an external raid array.  We are using grinder to generate load from a separate machine, sending PERSISTENT text messages (a canned 3K text msg) through 5 queues, to another grinder agent doing receives so no messages are getting backlogged during this test.  The specific test sequence is:

       

      1. Start load receivers, then senders
      2. Wait for system to reach steady state throughput
      3. Hard kill jms1 vm
      4. Verify failover to jms2
      5. Wait for traffic to reach steady state throughput on jms2
      6. Bring jms1 vm back up and online so that traffic will fail back
      7. Verify failback and steady throughput

       

      The above test sequence many times with varying load.  Pushing 500 messages/sec and executing the sequence causes no problems and all worked as expected.  Pushing 1000 messages/sec and executed the sequence causes problems.  In step 2, jms1 cpu is less than 50%, and at step 5 the same.  However, in step 6 when we boot jms1 back and start hornetq, the traffic moves back and flows for maybe 5-10 seconds and then the jms1 jvm pegs at 100% cpu and neither the senders or receivers are able to process messages.

       

      I can stop all load, killing the agent processes, and the jms1 jvm stays pegged at 100%, for hours if we leave it running.  The clients didn't fail back to jms2 after 5 minutes (then they were killed).  An strace against the process only yielded the following line:

       

      futex(0x5f9a8cc4, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>

       

      We have repeated this test several times with the same results.  Also, killing the jms1 process and bringing it back up again (all the while jms2 is running) sends it straight back to stuck at 100% within 20 seconds.

       

      Below are the specifics of the environment and the configurations.  Please sound off if you see something wrong with the configuration or have any ideas on how to resolve this   ANY help would be greatly appreciated.

       

      This is the only app running on the vms.  In the config below, I have changed them network addresses for security.

       

      -----

      CONFIGURATION AND ENVIRONMENT DETAILS:

      -----

       

      VMWare ESXi 4.1

       

      java version "1.6.0_29"

      Java(TM) SE Runtime Environment (build 1.6.0_29-b11)

      Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

       

      Red Hat Enterprise Linux Server release 5.8 (Tikanga)

       

      Shared over GFS2 mount against a Dell external storage array

       

      hornetq 2.2.5.Final started with start.sh->run.sh:

      {code}   

      export JVM_ARGS="$CLUSTER_PROPS -XX:+UseParallelGC -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms2048M -Xmx4096M -Dhornetq.config.dir=$CONFIG_DIR -Djava.util.logging.config.file=$CONFIG_DIR/logging.properties -Djava.library.path=."

       

      java $JVM_ARGS -classpath $CLASSPATH -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=6000 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false org.hornetq.integration.bootstrap.HornetQBootstrapServer $FILENAME

      {code}

       

      hornetq is the only application running on the vms except for system process like gfs2 stuff.

       

      Note that we have discovery disabled because the clients wouldn't fail back so we use static connectors.

       

      hornetq-beans.xml (jms1 and 2 are identify except for the IPs, which I have obfuscated except for the last octet)

      {code:xml}

      <?xml version="1.0" encoding="UTF-8"?>

       

      <deployment xmlns="urn:jboss:bean-deployer:2.0">

         <bean name="Naming" class="org.jnp.server.NamingBeanImpl"/>

       

         <bean name="JNDIServer" class="org.jnp.server.Main">

            <property name="namingInfo">

               <inject bean="Naming"/>

            </property>

            <property name="port">${jnp.port:1099}</property>

            <property name="bindAddress">${jnp.host:999.999.999.19}</property>

            <property name="rmiPort">${jnp.rmiPort:1098}</property>

            <property name="rmiBindAddress">${jnp.host:999.999.999.19}</property>

         </bean>

       

         <bean name="MBeanServer" class="javax.management.MBeanServer">

            <constructor factoryClass="java.lang.management.ManagementFactory"

                         factoryMethod="getPlatformMBeanServer"/>

         </bean>

       

         <bean name="Configuration" class="org.hornetq.core.config.impl.FileConfiguration">

         </bean>

       

         <bean name="HornetQSecurityManager" class="org.hornetq.spi.core.security.HornetQSecurityManagerImpl">

            <start ignored="true"/>

            <stop ignored="true"/>

         </bean>

       

         <bean name="HornetQServer" class="org.hornetq.core.server.impl.HornetQServerImpl">

            <constructor>

               <parameter>

                  <inject bean="Configuration"/>

               </parameter>

               <parameter>

                  <inject bean="MBeanServer"/>

               </parameter>

               <parameter>

                  <inject bean="HornetQSecurityManager"/>

               </parameter>       

            </constructor>

            <start ignored="true"/>

            <stop ignored="true"/>

         </bean>

       

         <bean name="JMSServerManager" class="org.hornetq.jms.server.impl.JMSServerManagerImpl">

            <constructor>        

               <parameter>

                  <inject bean="HornetQServer"/>

               </parameter>        

            </constructor>

         </bean>

      </deployment>

      {code}

       

       

       

      jms1 hornetq-configuration.xml

      {code:xml}

      <configuration xmlns="urn:hornetq"

                     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                     xsi:schemaLocation="urn:hornetq /schema/hornetq-configuration.xsd">

       

         <clustered>true</clustered>

       

         <failover-on-shutdown>true</failover-on-shutdown>

         <jmx-management-enabled>true</jmx-management-enabled>

         <allow-failback>true</allow-failback>

       

         <paging-directory>${data.dir:/jmsdata/hornetq}/paging</paging-directory>

       

         <bindings-directory>${data.dir:/jmsdata/hornetq}/bjournal</bindings-directory>

       

         <journal-directory>${data.dir:/jmsdata/hornetq}/mjournal</journal-directory>

       

         <journal-min-files>10</journal-min-files>

       

         <large-messages-directory>${data.dir:/jmsdata/hornetq}/large-messages</large-messages-directory>

       

         <cluster-user>BLAH.USER</cluster-user>

         <cluster-password>somepassword</cluster-password>

       

         <shared-store>true</shared-store>

         <connectors>     

            <connector name="netty">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

               <param key="host"  value="${hornetq.remoting.netty.host:999.999.999.19}"/>

               <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>

               <param key="use-nio" value="true"/>

            </connector>

       

            <connector name="netty-throughput">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

               <param key="host"  value="${hornetq.remoting.netty.host:999.999.999.19}"/>

               <param key="port"  value="${hornetq.remoting.netty.batch.port:5455}"/>

               <param key="batch-delay" value="50"/>

               <param key="use-nio" value="true"/>

            </connector>

       

           <connector name="backup-connector">

             <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

             <param key="host" value="999.999.999.20"/>

             <param key="port" value="5445"/>

             <param key="use-nio" value="true"/>

           </connector>

       

         </connectors>

       

         <acceptors>

            <acceptor name="netty">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

               <param key="host"  value="${hornetq.remoting.netty.host:999.999.999.19}"/>

               <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>

               <param key="use-nio" value="true"/>

            </acceptor>

       

            <acceptor name="netty-throughput">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

               <param key="host"  value="${hornetq.remoting.netty.host:999.999.999.19}"/>

               <param key="port"  value="${hornetq.remoting.netty.batch.port:5455}"/>

               <param key="batch-delay" value="50"/>

               <param key="direct-deliver" value="false"/>

               <param key="use-nio" value="true"/>

            </acceptor>

         </acceptors>

       

         <broadcast-groups>

            <broadcast-group name="bg-group1">

               <group-address>231.7.7.7</group-address>

               <group-port>9876</group-port>

               <broadcast-period>5000</broadcast-period>

               <connector-ref>netty</connector-ref>

            </broadcast-group>

         </broadcast-groups>

       

         <discovery-groups>

            <discovery-group name="dg-group1">

               <group-address>231.7.7.7</group-address>

               <group-port>9876</group-port>

               <refresh-timeout>10000</refresh-timeout>

            </discovery-group>

         </discovery-groups>

       

         <cluster-connections>

            <cluster-connection name="my-cluster">

               <address>jms</address>    

               <connector-ref>netty</connector-ref>

            <!--    <discovery-group-ref discovery-group-name="dg-group1"/> -->

               <retry-interval>500</retry-interval>

               <use-duplicate-detection>true</use-duplicate-detection>

               <forward-when-no-consumers>false</forward-when-no-consumers>

               <max-hops>1</max-hops>

               <static-connectors>

                  <!-- Without this the connection factory won.t be able to reconnect on failback -->

                  <connector-ref>backup-connector</connector-ref>      

               </static-connectors>

            </cluster-connection>

         </cluster-connections>

       

         <security-settings>

            <security-setting match="#">

               <permission type="createNonDurableQueue" roles="guest"/>

               <permission type="deleteNonDurableQueue" roles="guest"/>

               <permission type="consume" roles="guest"/>

               <permission type="send" roles="guest"/>

            </security-setting>

         </security-settings>

       

         <address-settings>

            <!--default for catch all-->

            <address-setting match="#">

               <dead-letter-address>jms.queue.DLQ</dead-letter-address>

               <expiry-address>jms.queue.ExpiryQueue</expiry-address>

               <redelivery-delay>0</redelivery-delay>

               <max-size-bytes>20485760</max-size-bytes>

           <page-size-bytes>10485760</page-size-bytes>

               <message-counter-history-day-limit>10</message-counter-history-day-limit>

               <address-full-policy>PAGE</address-full-policy>

            </address-setting>

         </address-settings>

      </configuration>

      {code}

       

      jms2 hornetq-configuration.xlm

      {code:xml}

      <configuration xmlns="urn:hornetq"

                     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                     xsi:schemaLocation="urn:hornetq /schema/hornetq-configuration.xsd">

       

         <clustered>true</clustered>

       

         <failover-on-shutdown>true</failover-on-shutdown>

         <jmx-management-enabled>true</jmx-management-enabled>

         <allow-failback>true</allow-failback>

       

         <paging-directory>${data.dir:/jmsdata/hornetq}/paging</paging-directory>

       

         <bindings-directory>${data.dir:/jmsdata/hornetq}/bjournal</bindings-directory>

       

         <journal-directory>${data.dir:/jmsdata/hornetq}/mjournal</journal-directory>

       

         <journal-min-files>10</journal-min-files>

       

         <large-messages-directory>${data.dir:/jmsdata/hornetq}/large-messages</large-messages-directory>

       

        <cluster-user>BLAH.USER</cluster-user>

        <cluster-password>somepassword</cluster-password>

       

         <backup>true</backup>

         <shared-store>true</shared-store>

         <connectors>     

            <connector name="netty">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

               <param key="host"  value="${hornetq.remoting.netty.host:999.999.999.20}"/>

               <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>

               <param key="use-nio" value="true"/>

            </connector>

       

            <connector name="netty-throughput">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

               <param key="host"  value="${hornetq.remoting.netty.host:999.999.999.20}"/>

               <param key="port"  value="${hornetq.remoting.netty.batch.port:5455}"/>

               <param key="batch-delay" value="50"/>

               <param key="use-nio" value="true"/>

            </connector>

       

            <connector name="live-connector">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

               <param key="host" value="${hornetq.remoting.netty.host:999.999.999.19}"/>

               <param key="port" value="${hornetq.remoting.netty.port:5445}"/>

               <param key="use-nio" value="true"/>

            </connector>

       

         </connectors>

       

         <acceptors>

            <acceptor name="netty">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

               <param key="host"  value="${hornetq.remoting.netty.host:999.999.999.20}"/>

               <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>

               <param key="use-nio" value="true"/>

            </acceptor>

       

            <acceptor name="netty-throughput">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

               <param key="host"  value="${hornetq.remoting.netty.host:999.999.999.20}"/>

               <param key="port"  value="${hornetq.remoting.netty.batch.port:5455}"/>

               <param key="batch-delay" value="50"/>

               <param key="direct-deliver" value="false"/>

               <param key="use-nio" value="true"/>

            </acceptor>

         </acceptors>

       

         <broadcast-groups>

            <broadcast-group name="bg-group1">

               <group-address>231.7.7.7</group-address>

               <group-port>9876</group-port>

               <broadcast-period>5000</broadcast-period>

               <connector-ref>netty</connector-ref>

            </broadcast-group>

         </broadcast-groups>

       

         <discovery-groups>

            <discovery-group name="dg-group1">

               <group-address>231.7.7.7</group-address>

               <group-port>9876</group-port>

               <refresh-timeout>10000</refresh-timeout>

            </discovery-group>

         </discovery-groups>

       

         <cluster-connections>

            <cluster-connection name="my-cluster">

               <address>jms</address>    

               <connector-ref>netty</connector-ref>

              <!--  <discovery-group-ref discovery-group-name="dg-group1"/> -->

               <retry-interval>500</retry-interval>

               <use-duplicate-detection>true</use-duplicate-detection>

               <forward-when-no-consumers>false</forward-when-no-consumers>

               <max-hops>1</max-hops>

               <static-connectors>

                  <connector-ref>live-connector</connector-ref>

               </static-connectors>

            </cluster-connection>

         </cluster-connections>

       

         <security-settings>

            <security-setting match="#">

               <permission type="createNonDurableQueue" roles="guest"/>

               <permission type="deleteNonDurableQueue" roles="guest"/>

               <permission type="consume" roles="guest"/>

               <permission type="send" roles="guest"/>

            </security-setting>

         </security-settings>

       

         <address-settings>

            <!--default for catch all-->

            <address-setting match="#">

               <dead-letter-address>jms.queue.DLQ</dead-letter-address>

               <expiry-address>jms.queue.ExpiryQueue</expiry-address>

               <redelivery-delay>0</redelivery-delay>

               <max-size-bytes>20485760</max-size-bytes>

           <page-size-bytes>10485760</page-size-bytes>

               <message-counter-history-day-limit>10</message-counter-history-day-limit>

               <address-full-policy>PAGE</address-full-policy>

            </address-setting>

         </address-settings>

      </configuration>

      {code}

       

      hornetq-jms.xml

      {code}

      <configuration xmlns="urn:hornetq"

                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                  xsi:schemaLocation="urn:hornetq /schema/hornetq-jms.xsd">

       

         <connection-factory name="NettyConnectionFactory">

            <xa>true</xa>

            <connectors>

               <connector-ref connector-name="netty"/>

            </connectors>

            <entries>

          <entry name="/ConnectionFactory"/>

      <!--

               <entry name="/XAConnectionFactory"/>

      -->

            </entries>

          <ha>true</ha>

          <retry-interval>2000</retry-interval>

          <retry-interval-multiplier>1.6</retry-interval-multiplier>

          <max-retry-interval>60000</max-retry-interval>

          <reconnect-attempts>-1</reconnect-attempts>

          <confirmation-window-size>1048576</confirmation-window-size>

         </connection-factory>

       

         <queue name="DLQ">

            <entry name="/queue/DLQ"/>

         </queue>

       

         <queue name="ExpiryQueue">

            <entry name="/queue/ExpiryQueue"/>

         </queue>

       

         <queue name="Blah">

            <entry name="/queue/Blah"/>

         </queue>

       

         ... more queue defs

       

      </configuration>

      {code}

       

      Client JNDI Properties

      {code}

      java.naming.factory.initial=org.jnp.interfaces.NamingContextFactory

      java.naming.provider.url=jnp://jms1:1099,jms2:1099

      java.naming.factory.url.pkgs=org.jboss.naming:org.jnp.interfaces

      {code}

       

      Grinder sender script excerpts

      {code}

      # Process initialization

      initialContext = InitialContext(properties)

      connectionFactory = initialContext.lookup("/ConnectionFactory")

      queue = initialContext.lookup(queueName)

      jmsConnection = connectionFactory.createConnection()

      jmsConnection.start()

      initialContext.close()

       

      # Per thread at thread initialization

      self.jmsSession = jmsConnection.createSession(False, Session.AUTO_ACKNOWLEDGE)

      self.jmsSender = self.jmsSession.createProducer(queue)

      self.jmsSender.setDeliveryMode(DeliveryMode.PERSISTENT)

       

      # Per run call on the thread

      # messageText is 3k of static text for the test

      jmsMessage = self.jmsSession.createTextMessage(messageText)

      self.jmsSender.send(jmsMessage)

      {code}

       

      Thanks!

       

      Message was edited by: davidmac Trying to fix use of the code tag