3 Replies Latest reply on Aug 26, 2014 11:05 AM by jbertram

In a collocated topology sometimes load balancing is not working after sometime with a huge load

veenaonnet Aug 25, 2014 8:38 AM

Hi,

I am using HornetQ 2.4 with JBoss with collocated topology with replication in paging mode. In a cluster, one node is getting more than 10k messages and initially it distributes some messages to the other node. After some time load balancing stops completely.

Got multiple WARN messages

Internal error! Delivery logic has identified a non delivery and still handled a consumer

Failed to receive datagram: java.lang.IndexOutOfBoundsException: readerIndex(8) + length(1735549293) exceeds writerIndex(65535): UnpooledHeapByteBuf(ridx: 8, widx: 65535, cap: 65535/65535)

On the working node:

2014-08-25 02:35:59,562 WARN [org.hornetq.core.server]
(Thread-19535
(HornetQ-remoting-threads-HornetQServerImpl::ServerUUID=2a019c17-2923-11e4-ab62-6d3b047b31a9-132330334-1430348893))
HQ224062: Queue jms.queue.FileChangeNotificationJobs was busy for more than
1,000 milliseconds. There are possibly consumers hanging on a network operation

Thread dump showed that the thread is blocked while creating a JMS connection. While the working node could get the connection and can receive message without blocking.

Erroneous Node:
Name: JobsManager-469     Id: 1416
State: BLOCKED
Lock name: org.hornetq.jms.client.HornetQJMSConnectionFactory@745a0e96
Lock owner: updateQCStructureJobsManager-188    Lock owner id: 1135
Total blocked: 6358629   Total waited: 35086553
Stack trace:
org.hornetq.jms.client.HornetQConnectionFactory.createConnectionInternal(HornetQConnectionFactory.java:666)
org.hornetq.jms.client.HornetQConnectionFactory.createConnection(HornetQConnectionFactory.java:115)

Working node:
Name: JobsManager-455     Id: 1485
State: TIMED_WAITING
Lock name: org.hornetq.core.client.impl.ClientConsumerImpl@3d618d35
Lock owner: null        Lock owner id: -1
Total blocked: 4605476   Total waited: 29145178
Stack trace:
java.lang.Object.wait(Native Method)
org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:259)
org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:401)

Please let me know a way to solve this issue.

Below is my configuration. Please let me know how to solve this issue?

<subsystem xmlns="urn:jboss:domain:messaging:1.3">
   <hornetq-server>
    <clustered>true</clustered>
<persistence-enabled>true</persistence-enabled>
    <security-enabled>true</security-enabled>
    <cluster-user>xyz</cluster-user>
    <cluster-password>xyz</cluster-password>
    <failover-on-shutdown>true</failover-on-shutdown>
    <shared-store>false</shared-store>
    <journal-type>ASYNCIO</journal-type>
    <journal-file-size>10485760</journal-file-size>
    <journal-min-files>2</journal-min-files>
    <check-for-live-server>true</check-for-live-server>
    <backup-group-name>x.x.x.1</backup-group-name>
    <paging-directory path="x.x.x.1/paging"/>
    <bindings-directory path="x.x.x.1/bindings"/>
    <journal-directory path="x.x.x.1/journal"/>
    <large-messages-directory path="x.x.x.1/large-messages"/>
    <max-saved-replicated-journals-size>40
    </max-saved-replicated-journals-size>
    <connectors>
     <connector name="netty">
      <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory
      </factory-class>
      <param key="host" value="x.x.x.1"/>
      <param key="port" value="5445"/>
     </connector>
     <netty-connector name="netty-throughput" socket-binding="messaging-throughput">
      <param key="batch-delay" value="50"/>
     </netty-connector>
     <in-vm-connector name="in-vm" server-id="0"/>
    </connectors>

    <acceptors>
     <acceptor name="netty">
                        <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
                        <param key="host" value="0.0.0.0"/>
                        <param key="port" value="5445"/>
                    </acceptor>
     <netty-acceptor name="netty-throughput" socket-binding="messaging-throughput">
      <param key="batch-delay" value="50"/>
      <param key="direct-deliver" value="false"/>
     </netty-acceptor>
     <in-vm-acceptor name="in-vm" server-id="0"/>
    </acceptors>

    <broadcast-groups>
     <broadcast-group name="bg-group">
      <group-address>228.x.x.99</group-address>
      <group-port>9876</group-port>
      <broadcast-period>5000</broadcast-period>
      <connector-ref>netty</connector-ref>
     </broadcast-group>
    </broadcast-groups>
    <discovery-groups>
     <discovery-group name="dg-group">
      <group-address>228.x.x.99</group-address>
      <group-port>9876</group-port>
      <refresh-timeout>10000</refresh-timeout>
     </discovery-group>
    </discovery-groups>
    <cluster-connections>
     <cluster-connection name="msf-cluster-x.x.x.99">
      <address>jms</address>
      <connector-ref>netty</connector-ref>
      <discovery-group-ref discovery-group-name="dg-group"/>
      <retry-interval>1000</retry-interval>
      <use-duplicate-detection>true</use-duplicate-detection>
      <forward-when-no-consumers>false</forward-when-no-consumers>
      <max-hops>1</max-hops>
     </cluster-connection>
    </cluster-connections>
    <security-settings>
     <security-setting match="#">
      <permission roles="guest" type="send"/>
      <permission roles="guest" type="consume"/>
      <permission roles="guest" type="createDurableQueue"/>
      <permission roles="guest" type="deleteDurableQueue"/>
      <permission roles="guest" type="createNonDurableQueue"/>
      <permission roles="guest" type="deleteNonDurableQueue"/>
      <permission roles="guest" type="manage"/>
     </security-setting>
    </security-settings>
    <address-settings>

     <address-setting match="#">
      <dead-letter-address>jms.queue.DLQ</dead-letter-address>
      <expiry-address>jms.queue.ExpiryQueue</expiry-address>
      <redelivery-delay>0</redelivery-delay>
      <redistribution-delay>0</redistribution-delay>
      <page-size-bytes>10485760</page-size-bytes>
      <max-size-bytes>104857600</max-size-bytes>
      <address-full-policy>PAGE</address-full-policy>
      <message-counter-history-day-limit>1
      </message-counter-history-day-limit>
     </address-setting>
    </address-settings>
    <jms-connection-factories>
     <connection-factory name="InVmConnectionFactory">
      <connectors>
       <connector-ref connector-name="in-vm"/>
      </connectors>
      <entries>
       <entry name="java:/ConnectionFactory"/>
      </entries>
      <ha>true</ha>
               <block-on-acknowledge>true</block-on-acknowledge>
                        <retry-interval>1000</retry-interval>
               <retry-interval-multiplier>1.0</retry-interval-multiplier>
                        <reconnect-attempts>3</reconnect-attempts>
      <client-failure-check-period>-1</client-failure-check-period>
                        <connection-ttl>-1</connection-ttl>
                        <confirmation-window-size>1000000</confirmation-window-size>
      <call-timeout>180000</call-timeout>
     </connection-factory>
     <connection-factory name="RemoteConnectionFactory">
      <connectors>
       <connector-ref connector-name="netty"/>
      </connectors>
      <entries>
       <entry name="java:jboss/exported/jms/RemoteConnectionFactory"/>
      </entries>
      <ha>true</ha>
      <block-on-acknowledge>true</block-on-acknowledge>
      <retry-interval>1000</retry-interval>
      <retry-interval-multiplier>1.0</retry-interval-multiplier>
      <reconnect-attempts>3</reconnect-attempts>
      <client-failure-check-period>60000</client-failure-check-period>
                        <connection-ttl>600000</connection-ttl>
      <confirmation-window-size>1000000</confirmation-window-size>
      <call-timeout>180000</call-timeout>
     </connection-factory>
     <pooled-connection-factory name="hornetq-ra">
      <transaction mode="xa"/>
      <connectors>
       <connector-ref connector-name="in-vm"/>
      </connectors>
      <entries>
       <entry name="java:/JmsXA"/>
      </entries>
     </pooled-connection-factory>
    </jms-connection-factories>
    <jms-destinations>
     <jms-queue name="Jobs">
      <entry name="queue/Jobs"/>
      <durable>true</durable>
     </jms-queue>
     <jms-topic name="Notification">
      <entry name="topic/Notification"/>
     </jms-topic>
    </jms-destinations>
   </hornetq-server>
   <hornetq-server name="backup">
    <clustered>true</clustered>
    <backup>true</backup>
    <persistence-enabled>true</persistence-enabled>
    <security-enabled>true</security-enabled>
    <cluster-user>xyz</cluster-user>
    <cluster-password>xyz</cluster-password>
    <allow-failback>true</allow-failback>
    <failover-on-shutdown>true</failover-on-shutdown>
    <shared-store>false</shared-store>
    <journal-type>ASYNCIO</journal-type>
    <journal-file-size>102400</journal-file-size>
    <journal-min-files>2</journal-min-files>
    <check-for-live-server>true</check-for-live-server>
    <backup-group-name>x.x.x.2
    </backup-group-name>
    <paging-directory path="x.x.x.2/paging"/>
    <bindings-directory path="x.x.x.2/bindings"/>
    <journal-directory path="x.x.x.2/journal"/>
    <large-messages-directory path="x.x.x.2/large-messages"/>
    <max-saved-replicated-journals-size>40
    </max-saved-replicated-journals-size>
    <live-connector-ref>netty</live-connector-ref>
    <connectors>
     <connector name="netty-backup">
      <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory
      </factory-class>
      <param key="host" value="x.x.x.1"/>
      <param key="port" value="5446"/>
     </connector>
     <in-vm-connector name="in-vm" server-id="1"/>
    </connectors>

    <acceptors>
     <in-vm-acceptor name="in-vm" server-id="1"/>
                    <acceptor name="netty-backup">
                        <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
                        <param key="host" value="0.0.0.0"/>
                        <param key="port" value="5446"/>
                    </acceptor>
    </acceptors>

    <broadcast-groups>
     <broadcast-group name="bg-group-backup">
      <group-address>228.x.x.99</group-address>
      <group-port>9876</group-port>
      <broadcast-period>5000</broadcast-period>
      <connector-ref>netty-backup</connector-ref>
     </broadcast-group>
    </broadcast-groups>
    <discovery-groups>
     <discovery-group name="dg-group-backup">
      <group-address>228.x.x.99</group-address>
      <group-port>9876</group-port>
      <refresh-timeout>10000</refresh-timeout>
     </discovery-group>
    </discovery-groups>
    <cluster-connections>
     <cluster-connection name="msf-cluster-x.x.x.99">
      <address>jms</address>
      <connector-ref>netty-backup</connector-ref>
      <discovery-group-ref discovery-group-name="dg-group-backup"/>
      <retry-interval>1000</retry-interval>
      <use-duplicate-detection>true</use-duplicate-detection>
      <forward-when-no-consumers>false</forward-when-no-consumers>
      <max-hops>1</max-hops>
     </cluster-connection>
    </cluster-connections>
    <security-settings>
     <security-setting match="#">
      <permission roles="guest" type="send"/>
      <permission roles="guest" type="consume"/>
      <permission roles="guest" type="createDurableQueue"/>
      <permission roles="guest" type="deleteDurableQueue"/>
      <permission roles="guest" type="createNonDurableQueue"/>
      <permission roles="guest" type="deleteNonDurableQueue"/>
      <permission roles="guest" type="manage"/>
     </security-setting>
    </security-settings>
    <address-settings>
     <address-setting match="#">
      <dead-letter-address>jms.queue.DLQ</dead-letter-address>
      <expiry-address>jms.queue.ExpiryQueue</expiry-address>
      <redelivery-delay>0</redelivery-delay>
      <redistribution-delay>0</redistribution-delay>
      <page-size-bytes>10485760</page-size-bytes>
      <max-size-bytes>104857600</max-size-bytes>
      <address-full-policy>PAGE</address-full-policy>
      <message-counter-history-day-limit>1
      </message-counter-history-day-limit>
     </address-setting>
    </address-settings>
   </hornetq-server>
</subsystem>

Regards,

Veena

1. Re: In a collocated topology sometimes load balancing is not working after sometime with a huge load

jbertram Aug 25, 2014 9:21 AM (in response to veenaonnet)

It would be useful to have a full thread dump (or at least a full stack-trace for the threads involved in the blockage) as well as a full log. Also, can you elaborate on what the JobsManager-* threads are and their role in your application?
Actions
2. Re: In a collocated topology sometimes load balancing is not working after sometime with a huge load

veenaonnet Aug 26, 2014 8:20 AM (in response to jbertram)

Hi Justin,

JobsManager-* threads are used to consume the messages from a queue and then process those messages. This threads creates a new JMS connection before consuming the message every time and closes it once consumed.
Code Snippet:
session = connect.createSession(false, Session.AUTO_ACKNOWLEDGE);
                destination = (Destination) MdlObjectFactory.getJBossInitialContext().lookup(queueName);
                consumer = session.createConsumer(destination, selector);
                connect.start();
                msg = null;
                if (timeout == -1) {
                    msg = (ObjectMessage) consumer.receive();
                } else if (timeout == 0) {
                    msg = (ObjectMessage) consumer.receiveNoWait();
                } else {
                    msg = (ObjectMessage) consumer.receive(timeout);
                }
            } finally {
                if (consumer != null) {
                    try {
                        consumer.close();
                    } catch (Throwable e) {
                    }
                }
                if (session != null) {
                    try {
                        session.close();
                    } catch (Throwable e) {
                    }
                }
                if (!providedConnection && connect != null) {
                    try {
                        connect.close();
                    } catch (Throwable e) {
                    }
                }

Complete stack trace is:
Name: JobsManager-469     Id: 1416
State: BLOCKED
Lock name: org.hornetq.jms.client.HornetQJMSConnectionFactory@745a0e96
Lock owner: fileChangeNotificationJobsManager-406       Lock owner id: 1353
Total blocked: 6331673   Total waited: 34939224
Stack trace:
org.hornetq.jms.client.HornetQConnectionFactory.createConnectionInternal(HornetQConnectionFactory.java:666)
org.hornetq.jms.client.HornetQConnectionFactory.createConnection(HornetQConnectionFactory.java:115)
com.omneon.common.msg.MessageFactory.getConnection(MessageFactory.java:124)
com.omneon.common.msg.MessageFactory.consumeMessage(MessageFactory.java:696)
com.omneon.dam.msg.QueueMessageManager.run(QueueMessageManager.java:264)
java.lang.Thread.run(Thread.java:744)
com.omneon.common.msg.MsfThread.run(MsfThread.java:42)

66 threads at this time waiting for fileChangeNotificationJobsManager-406 to finish. The situation remained same for couple of seconds. So every time when one thread is trying to create a connection, all other threads are waiting. On other node, did not see any occurrence of such blocking.

Name: JobsManager-406     Id: 1353
State: RUNNABLE
Lock name: null
Lock owner: null        Lock owner id: -1
Total blocked: 6330122   Total waited: 34938801
Stack trace:
java.util.concurrent.SynchronousQueue$TransferStack.casHead(SynchronousQueue.java:304)
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:381)
java.util.concurrent.SynchronousQueue.offer(SynchronousQueue.java:914)
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1364)
org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor.execute(OrderedExecutorFactory.java:136)
org.hornetq.core.remoting.impl.invm.InVMConnection.write(InVMConnection.java:147)
org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:267)
org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:194)
org.hornetq.core.client.impl.ClientSessionFactoryImpl.getConnection(ClientSessionFactoryImpl.java:1389)
org.hornetq.core.client.impl.ClientSessionFactoryImpl.getConnectionWithRetry(ClientSessionFactoryImpl.java:1072)
org.hornetq.core.client.impl.ClientSessionFactoryImpl.connect(ClientSessionFactoryImpl.java:249)
org.hornetq.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:885)
org.hornetq.jms.client.HornetQConnectionFactory.createConnectionInternal(HornetQConnectionFactory.java:672)
org.hornetq.jms.client.HornetQConnectionFactory.createConnection(HornetQConnectionFactory.java:115)
com.omneon.common.msg.MessageFactory.getConnection(MessageFactory.java:124)
com.omneon.common.msg.MessageFactory.consumeMessage(MessageFactory.java:696)
com.omneon.dam.msg.QueueMessageManager.run(QueueMessageManager.java:264)
java.lang.Thread.run(Thread.java:744)
com.omneon.common.msg.MsfThread.run(MsfThread.java:42)

Log only contains the WARN messages
Internal error! Delivery logic has identified a non delivery and still handled a consumer

Working node:
2014-08-25 02:35:59,562 WARN [org.hornetq.core.server]
(Thread-19535
(HornetQ-remoting-threads-HornetQServerImpl::ServerUUID=2a019c17-2923-11e4-ab62-6d3b047b31a9-132330334-1430348893))
HQ224062: Queue jms.queue.FileChangeNotificationJobs was busy for more than
1,000 milliseconds. There are possibly consumers hanging on a network operation

Regards,
Veena
Actions
3. Re: In a collocated topology sometimes load balancing is not working after sometime with a huge load

jbertram Aug 26, 2014 11:05 AM (in response to veenaonnet)

JobsManager-* threads are used to consume the messages from a queue and then process those messages. This threads creates a new JMS connection before consuming the message every time and closes it once consumed.

This is a huge anti-pattern and almost certainly the source of your problem. The easiest way to receive messages in the container using pooled resources is via message-driven beans. If you don't want to use container-managed resources then you need to manage them yourself (i.e. do your own connection pooling, cache your JNDI look-ups, etc.).

To repeat, creating and destroying a connection for every message you receive is strongly discouraged as it will hurt performance.
Actions

Go to original post