HornetQ clustering issues/questions
vincent.kirsch Apr 20, 2016 8:10 AMHi,
Sorry to be so vague in the title, but as you will seen it's not a simple case
The short version first:
We have a situation where we run 4 HornetQ servers in cluster mode, using broadcast for discovery, and with client load balancing enabled.
The cluster works fine most of the times, and we see indeed that messages are balanced between servers. However, at some point and for an as-of-yet undetermined reason, a some point 1 or 2 nodes on the cluster will stop receiving messages and will instead only consume messages from the other nodes.The problem is that once this behavior appears, it will not go away, meaning the nodes in question will never get messages sent to them directly until they are rebooted. It seems that restarting the sofwtare isn't always sufficient. After rebooting the server, the server acts "normally" again, but any of the 4 servers will likely act as described again later.
At this point I have no clue where to look to find out why this behavior occurs.
More details:
* We're using HornetQ 2.2.21. We can't upgrade to a more recent version. We can however upgrade to newer 2.2.X versions, but not further. It's a customer constraint
* HornetQ is embedded in our own software
* Our application uses Spring 3.2 (again, we can't upgrade beyond minor revisions, 3.2.X)
* As said earlier, the broadcast works fine, servers discover each other and can load balance, until one (or two) of the cluster nodes starts acting as described
Here are some configuration files snippets. The configuration is identical (except node ids) on all 4 servers.
1. HornetQ main config file hornetq-configuration.xml
<configuration xmlns="urn:hornetq"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:hornetq /schema/hornetq-configuration.xsd">
<management-address>jms.queue.hornetq.management</management-address>
<jmx-management-enabled>true</jmx-management-enabled>
<message-counter-enabled>true</message-counter-enabled>
<message-counter-sample-period>2000</message-counter-sample-period>
<message-counter-max-day-history>1</message-counter-max-day-history>
<clustered>true</clustered>
<cluster-user>HORNETQ.CLUSTER.ADMIN.USER</cluster-user>
<cluster-password>SOMEPASSWORD</cluster-password>
<paging-directory>${basedir.data.dir}/paging</paging-directory>
<bindings-directory>${basedir.data.dir}/bindings</bindings-directory>
<journal-directory>${basedir.data.dir}/journal</journal-directory>
<journal-min-files>10</journal-min-files>
<large-messages-directory>${basedir.data.dir}/large-messages</large-messages-directory>
<connectors>
<connector name="netty-connector">
<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>
<param key="host" value="${renditionserver.hornetq.remoting.netty.host:localhost}"/>
<param key="port" value="${renditionserver.hornetq.remoting.netty.port:5445}"/>
</connector>
<connector name="in-vm">
<factory-class>org.hornetq.core.remoting.impl.invm.InVMConnectorFactory</factory-class>
<param key="server-id" value="${renditionserver.hornetq.server.id:0}"/>
</connector>
</connectors>
<acceptors>
<acceptor name="netty-acceptor">
<factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>
<param key="host" value="${renditionserver.hornetq.remoting.netty.host:localhost}"/>
<param key="port" value="${renditionserver.hornetq.remoting.netty.port:5445}"/>
</acceptor>
<acceptor name="in-vm">
<factory-class>org.hornetq.core.remoting.impl.invm.InVMAcceptorFactory</factory-class>
<param key="server-id" value="${renditionserver.hornetq.server.id:0}"/>
</acceptor>
</acceptors>
<!-- Clustering configuration -->
<broadcast-groups>
<broadcast-group name="rds-broadcast-group">
<local-bind-address>${renditionserver.hornetq.remoting.netty.host:127.0.0.1}</local-bind-address>
<group-address>${renditionserver.hornetq.cluster.discovery.multicastip:231.7.7.7}</group-address>
<group-port>${renditionserver.hornetq.cluster.discovery.port:9876}</group-port>
<broadcast-period>2000</broadcast-period>
<connector-ref>netty-connector</connector-ref>
</broadcast-group>
</broadcast-groups>
<discovery-groups>
<discovery-group name="rds-discovery-group">
<local-bind-address>${renditionserver.hornetq.remoting.netty.host:127.0.0.1}</local-bind-address>
<group-address>${renditionserver.hornetq.cluster.discovery.multicastip:231.7.7.7}</group-address>
<group-port>${renditionserver.hornetq.cluster.discovery.port:9876}</group-port>
<refresh-timeout>10000</refresh-timeout>
</discovery-group>
</discovery-groups>
<cluster-connections>
<cluster-connection name="${renditionserver.hornetq.cluster.name}">
<address>jms</address>
<connector-ref>netty-connector</connector-ref>
<retry-interval>500</retry-interval>
<use-duplicate-detection>true</use-duplicate-detection>
<forward-when-no-consumers>false</forward-when-no-consumers>
<discovery-group-ref discovery-group-name="rds-discovery-group"/>
</cluster-connection>
</cluster-connections>
<security-settings>
<security-setting match="#">
<permission type="createNonDurableQueue" roles="guest"/>
<permission type="deleteNonDurableQueue" roles="guest"/>
<permission type="consume" roles="guest"/>
<permission type="send" roles="guest"/>
</security-setting>
<security-setting match="jms.queue.hornetq.management">
<permission type="manage" roles="guest" />
</security-setting>
</security-settings>
<address-settings>
<!--default for catch all-->
<address-setting match="#">
<dead-letter-address>jms.queue.DLQ</dead-letter-address>
<!-- Default redelivery settings -->
<redelivery-delay>5000</redelivery-delay>
<max-delivery-attempts>-1</max-delivery-attempts>
<max-size-bytes>10485760</max-size-bytes>
<message-counter-history-day-limit>10</message-counter-history-day-limit>
<address-full-policy>BLOCK</address-full-policy>
</address-setting>
</address-settings>
</configuration>
2. Spring configuration
<!-- declare spring properties as system properties (for hornetq config files) -->
<bean id="systemPrereqs" class="org.springframework.beans.factory.config.MethodInvokingFactoryBean" >
<property name="targetObject" value="#{@systemProperties}" />
<property name="targetMethod" value="putAll" />
<property name="arguments">
<util:properties location="classpath:/myapp.properties" />
</property>
</bean>
<!-- use placeholder for properties file -->
<bean id="propertyConfigurer" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="location" value="classpath:/myapp.properties" />
</bean>
<bean id="EmbeddedJms" class="org.hornetq.integration.spring.SpringJmsBootstrap" init-method="start" destroy-method="stop"/>
<bean id="cachedConnectionFactory" class="org.springframework.jms.connection.CachingConnectionFactory">
<property name="targetConnectionFactory" ref="ConnectionFactory" />
<property name="sessionCacheSize" value="10" />
</bean>
<!-- Transaction Manager - used for message redelivery -->
<bean id="transactionManager" class="org.springframework.jms.connection.JmsTransactionManager">
<property name="connectionFactory" ref="cachedConnectionFactory" />
</bean>
<!-- Message listener -->
<bean id="batchListenerIn" class="some.packges.BatchMessageListenerIn">
....Irrelevant config of the bean....
</bean>
...
<bean id="batchRenditionContainerIn" class="org.springframework.jms.listener.DefaultMessageListenerContainer">
<property name="concurrentConsumers" value="20"/>
<property name="connectionFactory" ref="cachedConnectionFactory" />
<property name="destination" ref="queue.request" />
<property name="messageListener" ref="batchListenerIn" />
<property name="transactionManager" ref="transactionManager"/>
</bean>
/beans>
<!-- Listener definition - used to receive asynchronous messages -->
<bean id="batchRenditionContainerIn" class="org.springframework.jms.listener.DefaultMessageListenerContainer">
<property name="concurrentConsumers" value="${renditionserver.rendition.concurrentconsumers}"/>
<property name="connectionFactory" ref="cachedConnectionFactory" />
<property name="destination" ref="queue.request" />
<property name="messageListener" ref="batchListenerIn" />
<property name="transactionManager" ref="transactionManager"/>
</bean>
As indicated, I would like to see if something's blatantly wrong in the above configuration, or if someone could at least give pointers as to where to look. Logs aren't really helpful; they help us see what happens, but not why.
Another strange thing we noticed, is that when a server is rebooted, we find lines in the logs about nodes id not being unique (don't have it under the hand at the time). I've read that it shouldn't be a real issue and it should "appear exactly once" after a server restart; however what we noticed is that this line appears 6 times, all with the same ID. I would have understood if it appeared 4 times, but not 6.
This kind of "phantom servers" phenomenon has also been seen when connecting with JMX. We saw 6 HornetQ instances where 4 were expected. After rebooting one of those, the number was 5 and remained so.
Hopefully someone can have an idea of what might happen.
Thanks,
Vincent