4 Replies Latest reply on Apr 21, 2016 10:08 AM by jbertram

HornetQ clustering issues/questions

vincent.kirsch Apr 20, 2016 8:10 AM

Hi,

Sorry to be so vague in the title, but as you will seen it's not a simple case

The short version first:

We have a situation where we run 4 HornetQ servers in cluster mode, using broadcast for discovery, and with client load balancing enabled.

The cluster works fine most of the times, and we see indeed that messages are balanced between servers. However, at some point and for an as-of-yet undetermined reason, a some point 1 or 2 nodes on the cluster will stop receiving messages and will instead only consume messages from the other nodes.The problem is that once this behavior appears, it will not go away, meaning the nodes in question will never get messages sent to them directly until they are rebooted. It seems that restarting the sofwtare isn't always sufficient. After rebooting the server, the server acts "normally" again, but any of the 4 servers will likely act as described again later.

At this point I have no clue where to look to find out why this behavior occurs.

More details:

* We're using HornetQ 2.2.21. We can't upgrade to a more recent version. We can however upgrade to newer 2.2.X versions, but not further. It's a customer constraint

* HornetQ is embedded in our own software

* Our application uses Spring 3.2 (again, we can't upgrade beyond minor revisions, 3.2.X)

* As said earlier, the broadcast works fine, servers discover each other and can load balance, until one (or two) of the cluster nodes starts acting as described

Here are some configuration files snippets. The configuration is identical (except node ids) on all 4 servers.

1. HornetQ main config file hornetq-configuration.xml

<configuration xmlns="urn:hornetq"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:hornetq /schema/hornetq-configuration.xsd">

<management-address>jms.queue.hornetq.management</management-address>

<jmx-management-enabled>true</jmx-management-enabled>

<message-counter-enabled>true</message-counter-enabled>

<message-counter-sample-period>2000</message-counter-sample-period>

<message-counter-max-day-history>1</message-counter-max-day-history>

<cluster-user>HORNETQ.CLUSTER.ADMIN.USER</cluster-user>

<cluster-password>SOMEPASSWORD</cluster-password>

<paging-directory>${basedir.data.dir}/paging</paging-directory>

<bindings-directory>${basedir.data.dir}/bindings</bindings-directory>

<journal-directory>${basedir.data.dir}/journal</journal-directory>

<journal-min-files>10</journal-min-files>

<large-messages-directory>${basedir.data.dir}/large-messages</large-messages-directory>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

<factory-class>org.hornetq.core.remoting.impl.invm.InVMConnectorFactory</factory-class>

</connector>

</connectors>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

</acceptor>

<factory-class>org.hornetq.core.remoting.impl.invm.InVMAcceptorFactory</factory-class>

</acceptor>

</acceptors>

<broadcast-groups>

<broadcast-group name="rds-broadcast-group">

<local-bind-address>${renditionserver.hornetq.remoting.netty.host:127.0.0.1}</local-bind-address>

<group-address>${renditionserver.hornetq.cluster.discovery.multicastip:231.7.7.7}</group-address>

<group-port>${renditionserver.hornetq.cluster.discovery.port:9876}</group-port>

<broadcast-period>2000</broadcast-period>

<connector-ref>netty-connector</connector-ref>

</broadcast-group>

</broadcast-groups>

<discovery-groups>

<discovery-group name="rds-discovery-group">

<local-bind-address>${renditionserver.hornetq.remoting.netty.host:127.0.0.1}</local-bind-address>

<group-address>${renditionserver.hornetq.cluster.discovery.multicastip:231.7.7.7}</group-address>

<group-port>${renditionserver.hornetq.cluster.discovery.port:9876}</group-port>

<refresh-timeout>10000</refresh-timeout>

</discovery-group>

</discovery-groups>

<cluster-connections>

<cluster-connection name="${renditionserver.hornetq.cluster.name}">

<connector-ref>netty-connector</connector-ref>

<retry-interval>500</retry-interval>

<use-duplicate-detection>true</use-duplicate-detection>

<forward-when-no-consumers>false</forward-when-no-consumers>

<discovery-group-ref discovery-group-name="rds-discovery-group"/>

</cluster-connection>

</cluster-connections>

<security-settings>

<security-setting match="#">

</security-setting>

<security-setting match="jms.queue.hornetq.management">

</security-setting>

</security-settings>

<address-settings>

<address-setting match="#">

<dead-letter-address>jms.queue.DLQ</dead-letter-address>

<redelivery-delay>5000</redelivery-delay>

<max-delivery-attempts>-1</max-delivery-attempts>

<max-size-bytes>10485760</max-size-bytes>

<message-counter-history-day-limit>10</message-counter-history-day-limit>

<address-full-policy>BLOCK</address-full-policy>

</address-setting>

</address-settings>

</configuration>

2. Spring configuration

<util:properties location="classpath:/myapp.properties" />

</property>

</bean>

</bean>

</bean>

</bean>

....Irrelevant config of the bean....

</bean>

...

</bean>

/beans>

</bean>

As indicated, I would like to see if something's blatantly wrong in the above configuration, or if someone could at least give pointers as to where to look. Logs aren't really helpful; they help us see what happens, but not why.

Another strange thing we noticed, is that when a server is rebooted, we find lines in the logs about nodes id not being unique (don't have it under the hand at the time). I've read that it shouldn't be a real issue and it should "appear exactly once" after a server restart; however what we noticed is that this line appears 6 times, all with the same ID. I would have understood if it appeared 4 times, but not 6.

This kind of "phantom servers" phenomenon has also been seen when connecting with JMX. We saw 6 HornetQ instances where 4 were expected. After rebooting one of those, the number was 5 and remained so.

Hopefully someone can have an idea of what might happen.

Thanks,

Vincent

1. Re: HornetQ clustering issues/questions

vincent.kirsch Apr 20, 2016 8:20 AM (in response to vincent.kirsch)

A few more things that could be relevant:
* JMS messages payload is very small. It contains a couple of URLs that are used to download/upload files.
* Something that could maybe help us would be a fail-safe method of knowing, in the clustering scenario, where a message comes from (the server itself or another server in the cluster). For the moment we must enable debug or trace log levels on Spring or HornetQ, which is impratcital given the size of the log files it generates
* Servers are Windows machine, I can ask the exact version if relevant
* No network issues were detected
* There isn't a very big workload, we're talking 4000-8000 messages per day.

Thanks!
Actions
2. Re: HornetQ clustering issues/questions

jbertram Apr 20, 2016 10:50 AM (in response to vincent.kirsch)
Couple of things:
HornetQ 2.2.21 was tagged almost 4 years ago now. Lots of work has been done since then.
You can try building the 2.2.x branch and using that instead of 2.2.21. Perhaps there was a bug fixed in there since 2.2.21 that would fix your issue.
Nothing in your configuration strikes me as wrong.
Clusters are typically used to deal with high message volume, but yours is so low you shouldn't need a cluster for that. You might be better served with just a single broker instance running on the network with a broad-cast group and then all the clients can use discovery to find it.
HornetQ is no longer active development. The HornetQ code-base was donated to Apache ActiveMQ over a year ago now and is continuing life as the ActiveMQ Artemis broker.
Actions
3. Re: HornetQ clustering issues/questions

vincent.kirsch Apr 21, 2016 6:00 AM (in response to jbertram)

Hi Justin,

Thanks for your reply.

I know we're using old versions etc., but as I said it's too complicated at this point to do upgrades.

Does this mean I shouldn't hope for more support in this particular case? Not even an idea of why it could happen, based on past post or issues? I looked around for such hings of course but never found anything matching exactly our problem.

Thanks,
Vincent.
Actions
4. Re: HornetQ clustering issues/questions

jbertram Apr 21, 2016 10:08 AM (in response to vincent.kirsch)

I know we're using old versions etc., but as I said it's too complicated at this point to do upgrades.

That's a recipe for a support nightmare.

Does this mean I shouldn't hope for more support in this particular case?

I personally wouldn't hope for much more support. Long-term support is provided to Red Hat clients who run Red Hat commercial, open-source software like JBoss EAP. The free, community side of things moves pretty fast and usually only provides short-term support. Resources are obviously limited.

Not even an idea of why it could happen, based on past post or issues?

Nothing comes to mind after reading through your description.
Actions

Go to original post