3 Replies Latest reply on Aug 26, 2014 11:05 AM by jbertram

    In a collocated topology sometimes load balancing is not working after sometime with a huge load

    veenaonnet

      Hi,


      I am using HornetQ 2.4 with JBoss with collocated topology with replication in paging mode. In a cluster, one node is getting more than 10k messages and initially it distributes some messages to the other node. After some time load balancing stops completely.

       

      Got multiple WARN messages

      Internal error! Delivery logic has identified a non delivery and still handled a consumer


      Failed to receive datagram: java.lang.IndexOutOfBoundsException: readerIndex(8) + length(1735549293) exceeds writerIndex(65535): UnpooledHeapByteBuf(ridx: 8, widx: 65535, cap: 65535/65535)


      On the working node:

      2014-08-25 02:35:59,562 WARN  [org.hornetq.core.server]
      (Thread-19535
      (HornetQ-remoting-threads-HornetQServerImpl::ServerUUID=2a019c17-2923-11e4-ab62-6d3b047b31a9-132330334-1430348893))
      HQ224062: Queue jms.queue.FileChangeNotificationJobs was busy for more than
      1,000 milliseconds. There are possibly consumers hanging on a network operation

       

      Thread dump showed that the thread is blocked while creating a JMS connection. While the working node could get the connection and can receive message without blocking.

      Erroneous Node:
      Name: JobsManager-469     Id: 1416
      State: BLOCKED
      Lock name: org.hornetq.jms.client.HornetQJMSConnectionFactory@745a0e96
      Lock owner: updateQCStructureJobsManager-188    Lock owner id: 1135
      Total blocked: 6358629   Total waited: 35086553
      Stack trace:
      org.hornetq.jms.client.HornetQConnectionFactory.createConnectionInternal(HornetQConnectionFactory.java:666)
      org.hornetq.jms.client.HornetQConnectionFactory.createConnection(HornetQConnectionFactory.java:115)


      Working node:
      Name: JobsManager-455     Id: 1485
      State: TIMED_WAITING
      Lock name: org.hornetq.core.client.impl.ClientConsumerImpl@3d618d35
      Lock owner: null        Lock owner id: -1
      Total blocked: 4605476   Total waited: 29145178
      Stack trace:
      java.lang.Object.wait(Native Method)
      org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:259)
      org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:401)

       

      Please let me know a way to solve this issue.

       
      Below is my configuration. Please let me know how to solve this issue?

      <subsystem xmlns="urn:jboss:domain:messaging:1.3">
         <hornetq-server>
          <clustered>true</clustered>
        <persistence-enabled>true</persistence-enabled>
          <security-enabled>true</security-enabled>
          <cluster-user>xyz</cluster-user>
          <cluster-password>xyz</cluster-password>
          <failover-on-shutdown>true</failover-on-shutdown>
          <shared-store>false</shared-store>
          <journal-type>ASYNCIO</journal-type>
          <journal-file-size>10485760</journal-file-size>
          <journal-min-files>2</journal-min-files>
          <check-for-live-server>true</check-for-live-server>
          <backup-group-name>x.x.x.1</backup-group-name>
          <paging-directory path="x.x.x.1/paging"/>
          <bindings-directory path="x.x.x.1/bindings"/>
          <journal-directory path="x.x.x.1/journal"/>
          <large-messages-directory path="x.x.x.1/large-messages"/>
          <max-saved-replicated-journals-size>40
          </max-saved-replicated-journals-size>
          <connectors>
           <connector name="netty">
            <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory
            </factory-class>
            <param key="host" value="x.x.x.1"/>
            <param key="port" value="5445"/>
           </connector>
           <netty-connector name="netty-throughput" socket-binding="messaging-throughput">
            <param key="batch-delay" value="50"/>
           </netty-connector>
           <in-vm-connector name="in-vm" server-id="0"/>
          </connectors>

          <acceptors>
           <acceptor name="netty">
                              <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
                              <param key="host" value="0.0.0.0"/>
                              <param key="port" value="5445"/>
                          </acceptor>
           <netty-acceptor name="netty-throughput" socket-binding="messaging-throughput">
            <param key="batch-delay" value="50"/>
            <param key="direct-deliver" value="false"/>
           </netty-acceptor>
           <in-vm-acceptor name="in-vm" server-id="0"/>
          </acceptors>

          <broadcast-groups>
           <broadcast-group name="bg-group">
            <group-address>228.x.x.99</group-address>
            <group-port>9876</group-port>
            <broadcast-period>5000</broadcast-period>
            <connector-ref>netty</connector-ref>
           </broadcast-group>
          </broadcast-groups>
          <discovery-groups>
           <discovery-group name="dg-group">
            <group-address>228.x.x.99</group-address>
            <group-port>9876</group-port>
            <refresh-timeout>10000</refresh-timeout>
           </discovery-group>
          </discovery-groups>
          <cluster-connections>
           <cluster-connection name="msf-cluster-x.x.x.99">
            <address>jms</address>
            <connector-ref>netty</connector-ref>
            <discovery-group-ref discovery-group-name="dg-group"/>
            <retry-interval>1000</retry-interval>
            <use-duplicate-detection>true</use-duplicate-detection>
            <forward-when-no-consumers>false</forward-when-no-consumers>
            <max-hops>1</max-hops>
           </cluster-connection>
          </cluster-connections>
          <security-settings>
           <security-setting match="#">
            <permission roles="guest" type="send"/>
            <permission roles="guest" type="consume"/>
            <permission roles="guest" type="createDurableQueue"/>
            <permission roles="guest" type="deleteDurableQueue"/>
            <permission roles="guest" type="createNonDurableQueue"/>
            <permission roles="guest" type="deleteNonDurableQueue"/>
            <permission roles="guest" type="manage"/>
           </security-setting>
          </security-settings>
          <address-settings>
          
           <address-setting match="#">
            <dead-letter-address>jms.queue.DLQ</dead-letter-address>
            <expiry-address>jms.queue.ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <redistribution-delay>0</redistribution-delay>
            <page-size-bytes>10485760</page-size-bytes>
            <max-size-bytes>104857600</max-size-bytes>
            <address-full-policy>PAGE</address-full-policy>
            <message-counter-history-day-limit>1
            </message-counter-history-day-limit>
           </address-setting>
          </address-settings>
          <jms-connection-factories>
           <connection-factory name="InVmConnectionFactory">
            <connectors>
             <connector-ref connector-name="in-vm"/>
            </connectors>
            <entries>
             <entry name="java:/ConnectionFactory"/>
            </entries>
            <ha>true</ha>
                     <block-on-acknowledge>true</block-on-acknowledge>
                              <retry-interval>1000</retry-interval>
                     <retry-interval-multiplier>1.0</retry-interval-multiplier>
                              <reconnect-attempts>3</reconnect-attempts>
            <client-failure-check-period>-1</client-failure-check-period>
                              <connection-ttl>-1</connection-ttl>
                              <confirmation-window-size>1000000</confirmation-window-size>
            <call-timeout>180000</call-timeout>
           </connection-factory>
           <connection-factory name="RemoteConnectionFactory">
            <connectors>
             <connector-ref connector-name="netty"/>
            </connectors>
            <entries>
             <entry name="java:jboss/exported/jms/RemoteConnectionFactory"/>
            </entries>
            <ha>true</ha>
            <block-on-acknowledge>true</block-on-acknowledge>
            <retry-interval>1000</retry-interval>
            <retry-interval-multiplier>1.0</retry-interval-multiplier>
            <reconnect-attempts>3</reconnect-attempts>
            <client-failure-check-period>60000</client-failure-check-period>
                              <connection-ttl>600000</connection-ttl>
            <confirmation-window-size>1000000</confirmation-window-size>
            <call-timeout>180000</call-timeout>
           </connection-factory>
           <pooled-connection-factory name="hornetq-ra">
            <transaction mode="xa"/>
            <connectors>
             <connector-ref connector-name="in-vm"/>
            </connectors>
            <entries>
             <entry name="java:/JmsXA"/>
            </entries>
           </pooled-connection-factory>
          </jms-connection-factories>
          <jms-destinations>
           <jms-queue name="Jobs">
            <entry name="queue/Jobs"/>
            <durable>true</durable>
           </jms-queue>
           <jms-topic name="Notification">
            <entry name="topic/Notification"/>
           </jms-topic>
          </jms-destinations>
         </hornetq-server>
         <hornetq-server name="backup">
          <clustered>true</clustered>
          <backup>true</backup>
          <persistence-enabled>true</persistence-enabled>
          <security-enabled>true</security-enabled>
          <cluster-user>xyz</cluster-user>
          <cluster-password>xyz</cluster-password>
          <allow-failback>true</allow-failback>
          <failover-on-shutdown>true</failover-on-shutdown>
          <shared-store>false</shared-store>
          <journal-type>ASYNCIO</journal-type>
          <journal-file-size>102400</journal-file-size>
          <journal-min-files>2</journal-min-files>
          <check-for-live-server>true</check-for-live-server>
          <backup-group-name>x.x.x.2
          </backup-group-name>
          <paging-directory path="x.x.x.2/paging"/>
          <bindings-directory path="x.x.x.2/bindings"/>
          <journal-directory path="x.x.x.2/journal"/>
          <large-messages-directory path="x.x.x.2/large-messages"/>
          <max-saved-replicated-journals-size>40
          </max-saved-replicated-journals-size>
          <live-connector-ref>netty</live-connector-ref>
          <connectors>
           <connector name="netty-backup">
            <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory
            </factory-class>
            <param key="host" value="x.x.x.1"/>
            <param key="port" value="5446"/>
           </connector>
           <in-vm-connector name="in-vm" server-id="1"/>
          </connectors>

          <acceptors>
           <in-vm-acceptor name="in-vm" server-id="1"/>
                          <acceptor name="netty-backup">
                              <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
                              <param key="host" value="0.0.0.0"/>
                              <param key="port" value="5446"/>
                          </acceptor>
          </acceptors>

          <broadcast-groups>
           <broadcast-group name="bg-group-backup">
            <group-address>228.x.x.99</group-address>
            <group-port>9876</group-port>
            <broadcast-period>5000</broadcast-period>
            <connector-ref>netty-backup</connector-ref>
           </broadcast-group>
          </broadcast-groups>
          <discovery-groups>
           <discovery-group name="dg-group-backup">
            <group-address>228.x.x.99</group-address>
            <group-port>9876</group-port>
            <refresh-timeout>10000</refresh-timeout>
           </discovery-group>
          </discovery-groups>
          <cluster-connections>
           <cluster-connection name="msf-cluster-x.x.x.99">
            <address>jms</address>
            <connector-ref>netty-backup</connector-ref>
            <discovery-group-ref discovery-group-name="dg-group-backup"/>
            <retry-interval>1000</retry-interval>
            <use-duplicate-detection>true</use-duplicate-detection>
            <forward-when-no-consumers>false</forward-when-no-consumers>
            <max-hops>1</max-hops>
           </cluster-connection>
          </cluster-connections>
          <security-settings>
           <security-setting match="#">
            <permission roles="guest" type="send"/>
            <permission roles="guest" type="consume"/>
            <permission roles="guest" type="createDurableQueue"/>
            <permission roles="guest" type="deleteDurableQueue"/>
            <permission roles="guest" type="createNonDurableQueue"/>
            <permission roles="guest" type="deleteNonDurableQueue"/>
            <permission roles="guest" type="manage"/>
           </security-setting>
          </security-settings>
          <address-settings>
           <address-setting match="#">
            <dead-letter-address>jms.queue.DLQ</dead-letter-address>
            <expiry-address>jms.queue.ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <redistribution-delay>0</redistribution-delay>
            <page-size-bytes>10485760</page-size-bytes>
            <max-size-bytes>104857600</max-size-bytes>
            <address-full-policy>PAGE</address-full-policy>
            <message-counter-history-day-limit>1
            </message-counter-history-day-limit>
           </address-setting>
          </address-settings>
         </hornetq-server>
        </subsystem>

       

      Regards,

      Veena

        • 1. Re: In a collocated topology sometimes load balancing is not working after sometime with a huge load
          jbertram

          It would be useful to have a full thread dump (or at least a full stack-trace for the threads involved in the blockage) as well as a full log.  Also, can you elaborate on what the JobsManager-* threads are and their role in your application?

          • 2. Re: In a collocated topology sometimes load balancing is not working after sometime with a huge load
            veenaonnet

            Hi Justin,

             

            JobsManager-* threads are used to consume the messages from a queue and then process those messages. This threads creates a new JMS connection before consuming the message every time and closes it once consumed.

            Code Snippet:

            session = connect.createSession(false, Session.AUTO_ACKNOWLEDGE);
                            destination = (Destination) MdlObjectFactory.getJBossInitialContext().lookup(queueName);
                            consumer = session.createConsumer(destination, selector);
                            connect.start();

                            msg = null;
                            if (timeout == -1) {
                                msg = (ObjectMessage) consumer.receive();
                            } else if (timeout == 0) {
                                msg = (ObjectMessage) consumer.receiveNoWait();
                            } else {
                                msg = (ObjectMessage) consumer.receive(timeout);
                            }
                        } finally {
                            if (consumer != null) {
                                try {
                                    consumer.close();
                                } catch (Throwable e) {
                                }
                            }
                            if (session != null) {
                                try {
                                    session.close();
                                } catch (Throwable e) {
                                }
                            }
                            if (!providedConnection && connect != null) {
                                try {
                                    connect.close();
                                } catch (Throwable e) {
                                }
                            }

             

             

            Complete stack trace is:

            Name: JobsManager-469     Id: 1416
            State: BLOCKED
            Lock name: org.hornetq.jms.client.HornetQJMSConnectionFactory@745a0e96
            Lock owner: fileChangeNotificationJobsManager-406       Lock owner id: 1353
            Total blocked: 6331673   Total waited: 34939224
            Stack trace:
            org.hornetq.jms.client.HornetQConnectionFactory.createConnectionInternal(HornetQConnectionFactory.java:666)
            org.hornetq.jms.client.HornetQConnectionFactory.createConnection(HornetQConnectionFactory.java:115)
            com.omneon.common.msg.MessageFactory.getConnection(MessageFactory.java:124)
            com.omneon.common.msg.MessageFactory.consumeMessage(MessageFactory.java:696)
            com.omneon.dam.msg.QueueMessageManager.run(QueueMessageManager.java:264)
            java.lang.Thread.run(Thread.java:744)
            com.omneon.common.msg.MsfThread.run(MsfThread.java:42)

             

            66 threads at this time waiting for fileChangeNotificationJobsManager-406 to finish.  The situation remained same for couple of seconds. So every time when one thread is trying to create a connection, all other threads are waiting. On other node, did not see any occurrence of such blocking.

             

            Name: JobsManager-406     Id: 1353
            State: RUNNABLE
            Lock name: null
            Lock owner: null        Lock owner id: -1
            Total blocked: 6330122   Total waited: 34938801
            Stack trace:
            java.util.concurrent.SynchronousQueue$TransferStack.casHead(SynchronousQueue.java:304)
            java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:381)
            java.util.concurrent.SynchronousQueue.offer(SynchronousQueue.java:914)
            java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1364)
            org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor.execute(OrderedExecutorFactory.java:136)
            org.hornetq.core.remoting.impl.invm.InVMConnection.write(InVMConnection.java:147)
            org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:267)
            org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:194)
            org.hornetq.core.client.impl.ClientSessionFactoryImpl.getConnection(ClientSessionFactoryImpl.java:1389)
            org.hornetq.core.client.impl.ClientSessionFactoryImpl.getConnectionWithRetry(ClientSessionFactoryImpl.java:1072)
            org.hornetq.core.client.impl.ClientSessionFactoryImpl.connect(ClientSessionFactoryImpl.java:249)
            org.hornetq.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:885)
            org.hornetq.jms.client.HornetQConnectionFactory.createConnectionInternal(HornetQConnectionFactory.java:672)
            org.hornetq.jms.client.HornetQConnectionFactory.createConnection(HornetQConnectionFactory.java:115)
            com.omneon.common.msg.MessageFactory.getConnection(MessageFactory.java:124)
            com.omneon.common.msg.MessageFactory.consumeMessage(MessageFactory.java:696)
            com.omneon.dam.msg.QueueMessageManager.run(QueueMessageManager.java:264)
            java.lang.Thread.run(Thread.java:744)
            com.omneon.common.msg.MsfThread.run(MsfThread.java:42)

             

            Log only contains the WARN messages

            Internal error! Delivery logic has identified a non delivery and still handled a consumer

             

            Working node:

            2014-08-25 02:35:59,562 WARN  [org.hornetq.core.server]
            (Thread-19535
            (HornetQ-remoting-threads-HornetQServerImpl::ServerUUID=2a019c17-2923-11e4-ab62-6d3b047b31a9-132330334-1430348893))
            HQ224062: Queue jms.queue.FileChangeNotificationJobs was busy for more than
            1,000 milliseconds. There are possibly consumers hanging on a network operation

             

            Regards,

            Veena

            • 3. Re: In a collocated topology sometimes load balancing is not working after sometime with a huge load
              jbertram

              JobsManager-* threads are used to consume the messages from a queue and then process those messages. This threads creates a new JMS connection before consuming the message every time and closes it once consumed.

              This is a huge anti-pattern and almost certainly the source of your problem.  The easiest way to receive messages in the container using pooled resources is via message-driven beans.  If you don't want to use container-managed resources then you need to manage them yourself (i.e. do your own connection pooling, cache your JNDI look-ups, etc.). 

               

              To repeat, creating and destroying a connection for every message you receive is strongly discouraged as it will hurt performance.