4 Replies Latest reply on Apr 21, 2016 10:08 AM by jbertram

    HornetQ clustering issues/questions

    vincent.kirsch

      Hi,

       

      Sorry to be so vague in the title, but as you will seen it's not a simple case

       

      The short version first:

      We have a situation where we run 4 HornetQ servers in cluster mode, using broadcast for discovery, and with client load balancing enabled.

      The cluster works fine most of the times, and we see indeed that messages are balanced between servers. However, at some point and for an as-of-yet undetermined reason, a some point 1 or 2 nodes on the cluster will stop receiving messages and will instead only consume messages from the other nodes.The problem is that once this behavior appears, it will not go away, meaning the nodes in question will never get messages sent to them directly until they are rebooted. It seems that restarting the sofwtare isn't always sufficient. After rebooting the server, the server acts "normally" again, but any of the 4 servers will likely act as described again later.

       

      At this point I have no clue where to look to find out why this behavior occurs.

       

      More details:

      * We're using HornetQ 2.2.21. We can't upgrade to a more recent version. We can however upgrade to newer 2.2.X versions, but not further. It's a customer constraint

      * HornetQ is embedded in our own software

      * Our application uses Spring 3.2 (again, we can't upgrade beyond minor revisions, 3.2.X)

      * As said earlier, the broadcast works fine, servers discover each other and can load balance, until one (or two) of the cluster nodes starts acting as described

       

      Here are some configuration files snippets. The configuration is identical (except node ids) on all 4 servers.

       

      1. HornetQ main config file hornetq-configuration.xml

       

      <configuration xmlns="urn:hornetq"

                     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                     xsi:schemaLocation="urn:hornetq /schema/hornetq-configuration.xsd">

       

       

         <management-address>jms.queue.hornetq.management</management-address>

         <jmx-management-enabled>true</jmx-management-enabled>

         <message-counter-enabled>true</message-counter-enabled>

         <message-counter-sample-period>2000</message-counter-sample-period>

         <message-counter-max-day-history>1</message-counter-max-day-history>

         <clustered>true</clustered>

         <cluster-user>HORNETQ.CLUSTER.ADMIN.USER</cluster-user>

         <cluster-password>SOMEPASSWORD</cluster-password>

         

         <paging-directory>${basedir.data.dir}/paging</paging-directory>

         <bindings-directory>${basedir.data.dir}/bindings</bindings-directory>

         <journal-directory>${basedir.data.dir}/journal</journal-directory>

         <journal-min-files>10</journal-min-files>

         <large-messages-directory>${basedir.data.dir}/large-messages</large-messages-directory>

        

         <connectors>

             <connector name="netty-connector">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

               <param key="host"  value="${renditionserver.hornetq.remoting.netty.host:localhost}"/>

               <param key="port"  value="${renditionserver.hornetq.remoting.netty.port:5445}"/>

            </connector>

            <connector name="in-vm">

               <factory-class>org.hornetq.core.remoting.impl.invm.InVMConnectorFactory</factory-class>

               <param key="server-id" value="${renditionserver.hornetq.server.id:0}"/>

            </connector>

         </connectors>

       

       

         <acceptors>

            <acceptor name="netty-acceptor">

               <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

               <param key="host"  value="${renditionserver.hornetq.remoting.netty.host:localhost}"/>

               <param key="port"  value="${renditionserver.hornetq.remoting.netty.port:5445}"/>

            </acceptor>

       

       

             <acceptor name="in-vm">

              <factory-class>org.hornetq.core.remoting.impl.invm.InVMAcceptorFactory</factory-class>

              <param key="server-id" value="${renditionserver.hornetq.server.id:0}"/>

            </acceptor>

         </acceptors>

        

         <!-- Clustering configuration -->

          <broadcast-groups>

            <broadcast-group name="rds-broadcast-group">

               <local-bind-address>${renditionserver.hornetq.remoting.netty.host:127.0.0.1}</local-bind-address>

               <group-address>${renditionserver.hornetq.cluster.discovery.multicastip:231.7.7.7}</group-address>

               <group-port>${renditionserver.hornetq.cluster.discovery.port:9876}</group-port>

               <broadcast-period>2000</broadcast-period>

               <connector-ref>netty-connector</connector-ref>

           </broadcast-group>

         </broadcast-groups>

        

         <discovery-groups>

            <discovery-group name="rds-discovery-group">

               <local-bind-address>${renditionserver.hornetq.remoting.netty.host:127.0.0.1}</local-bind-address>

               <group-address>${renditionserver.hornetq.cluster.discovery.multicastip:231.7.7.7}</group-address>

               <group-port>${renditionserver.hornetq.cluster.discovery.port:9876}</group-port>

               <refresh-timeout>10000</refresh-timeout>

            </discovery-group>

         </discovery-groups>

         

         <cluster-connections>

            <cluster-connection name="${renditionserver.hornetq.cluster.name}">

               <address>jms</address>

               <connector-ref>netty-connector</connector-ref>

               <retry-interval>500</retry-interval>

               <use-duplicate-detection>true</use-duplicate-detection>

               <forward-when-no-consumers>false</forward-when-no-consumers>

               <discovery-group-ref discovery-group-name="rds-discovery-group"/>

            </cluster-connection>

         </cluster-connections>

         <security-settings>

            <security-setting match="#">

               <permission type="createNonDurableQueue" roles="guest"/>

               <permission type="deleteNonDurableQueue" roles="guest"/>

               <permission type="consume" roles="guest"/>

               <permission type="send" roles="guest"/>

            </security-setting>

            <security-setting match="jms.queue.hornetq.management">

          <permission type="manage" roles="guest" />

         </security-setting>

         </security-settings>

        

         <address-settings>

            <!--default for catch all-->

            <address-setting match="#">

               <dead-letter-address>jms.queue.DLQ</dead-letter-address>

               <!-- Default redelivery settings -->

               <redelivery-delay>5000</redelivery-delay>

               <max-delivery-attempts>-1</max-delivery-attempts>

               <max-size-bytes>10485760</max-size-bytes>      

               <message-counter-history-day-limit>10</message-counter-history-day-limit>

               <address-full-policy>BLOCK</address-full-policy>

            </address-setting>

         </address-settings>

       

       

      </configuration>

       

      2. Spring configuration

       

        <!-- declare spring properties as system properties (for hornetq config files) -->

          <bean id="systemPrereqs" class="org.springframework.beans.factory.config.MethodInvokingFactoryBean" >

              <property name="targetObject" value="#{@systemProperties}" />

              <property name="targetMethod" value="putAll" />

              <property name="arguments">

                  <util:properties location="classpath:/myapp.properties" />

              </property>

          </bean>

          <!-- use placeholder for properties file -->

        <bean id="propertyConfigurer" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">

           <property name="location" value="classpath:/myapp.properties" />

        </bean>

        

          <bean id="EmbeddedJms" class="org.hornetq.integration.spring.SpringJmsBootstrap" init-method="start" destroy-method="stop"/>

       

       

          <bean id="cachedConnectionFactory" class="org.springframework.jms.connection.CachingConnectionFactory">

              <property name="targetConnectionFactory" ref="ConnectionFactory" />

              <property name="sessionCacheSize" value="10" />

          </bean>

       

       

          <!-- Transaction Manager - used for message redelivery -->

          <bean id="transactionManager" class="org.springframework.jms.connection.JmsTransactionManager">

              <property name="connectionFactory" ref="cachedConnectionFactory" />

          </bean>

       

        <!-- Message listener -->

        <bean id="batchListenerIn" class="some.packges.BatchMessageListenerIn">

          ....Irrelevant config of the bean....

        </bean>

      ...

       

      <bean id="batchRenditionContainerIn" class="org.springframework.jms.listener.DefaultMessageListenerContainer">

      <property name="concurrentConsumers" value="20"/>

      <property name="connectionFactory" ref="cachedConnectionFactory" />

      <property name="destination" ref="queue.request" />

      <property name="messageListener" ref="batchListenerIn" />

              <property name="transactionManager" ref="transactionManager"/>

      </bean>


      /beans>

       

      <!-- Listener definition - used to receive asynchronous messages -->

      <bean id="batchRenditionContainerIn" class="org.springframework.jms.listener.DefaultMessageListenerContainer">

        <property name="concurrentConsumers" value="${renditionserver.rendition.concurrentconsumers}"/>

         <property name="connectionFactory" ref="cachedConnectionFactory" />

         <property name="destination" ref="queue.request" />

          <property name="messageListener" ref="batchListenerIn" />     

          <property name="transactionManager" ref="transactionManager"/>

      </bean>



      As indicated, I would like to see if something's blatantly wrong in the above configuration, or if someone could at least give pointers as to where to look. Logs aren't really helpful; they help us see what happens, but not why.


      Another strange thing we noticed, is that when a server is rebooted, we find lines in the logs about nodes id not being unique (don't have it under the hand at the time). I've read that it shouldn't be a real issue and it should "appear exactly once" after a server restart; however what we noticed is that this line appears 6 times, all with the same ID. I would have understood if it appeared 4 times, but not 6.

      This kind of "phantom servers" phenomenon has also been seen when connecting with JMX. We saw 6 HornetQ instances where 4 were expected. After rebooting one of those, the number was 5 and remained so.


      Hopefully someone can have an idea of what might happen.



      Thanks,

      Vincent


        • 1. Re: HornetQ clustering issues/questions
          vincent.kirsch

          A few more things that could be relevant:

          * JMS messages payload is very small. It contains a couple of URLs that are used to download/upload files.

          * Something that could maybe help us would be a fail-safe method of knowing, in the clustering scenario, where a message comes from (the server itself or another server in the cluster). For the moment we must enable debug or trace log levels on Spring or HornetQ, which is impratcital given the size of the log files it generates

          * Servers are Windows machine, I can ask the exact version if relevant

          * No network issues were detected

          * There isn't a very big workload, we're talking 4000-8000 messages per day.

           

          Thanks!

          • 2. Re: HornetQ clustering issues/questions
            jbertram

            Couple of things:

            1. HornetQ 2.2.21 was tagged almost 4 years ago now.  Lots of work has been done since then.
            2. You can try building the 2.2.x branch and using that instead of 2.2.21.  Perhaps there was a bug fixed in there since 2.2.21 that would fix your issue.
            3. Nothing in your configuration strikes me as wrong.
            4. Clusters are typically used to deal with high message volume, but yours is so low you shouldn't need a cluster for that.  You might be better served with just a single broker instance running on the network with a broad-cast group and then all the clients can use discovery to find it.
            5. HornetQ is no longer active development.  The HornetQ code-base was donated to Apache ActiveMQ over a year ago now and is continuing life as the ActiveMQ Artemis broker.
            • 3. Re: HornetQ clustering issues/questions
              vincent.kirsch

              Hi Justin,

               

              Thanks for your reply.

               

              I know we're using old versions etc., but as I said it's too complicated at this point to do upgrades.

               

              Does this mean I shouldn't hope for more support in this particular case? Not even an idea of why it could happen, based on past post or issues? I looked around for such hings of course but never found anything matching exactly our problem.

               

              Thanks,

              Vincent.

              • 4. Re: HornetQ clustering issues/questions
                jbertram

                I know we're using old versions etc., but as I said it's too complicated at this point to do upgrades.

                That's a recipe for a support nightmare.

                 

                Does this mean I shouldn't hope for more support in this particular case?

                I personally wouldn't hope for much more support.  Long-term support is provided to Red Hat clients who run Red Hat commercial, open-source software like JBoss EAP.  The free, community side of things moves pretty fast and usually only provides short-term support.  Resources are obviously limited.

                 

                Not even an idea of why it could happen, based on past post or issues?

                Nothing comes to mind after reading through your description.