6 Replies Latest reply on Dec 11, 2012 2:36 AM by Alexander Hartner

    Cluster configuration and missing messages

    Alexander Hartner Expert

      I have a cluster of two active hornetq installations as well as corresponding backup nodes for each of the active servers.These are distributed across two systems.

       

      System1System2
      Active Server1Active Server 2
      Backup Server 2Backup Server 1
      Shared NFS directories

       

      Active Server 1 and Backup Server 1 share a common directory on an NFS share. Similarly Active Server 2 and Backup Server 2 do the same on a different directory.

       

      My aim is to achiave high availabity and distribute the work load between two systems.

       

      The connection factory is configured as follows:

        <connection-factory name="NettyConnectionFactory">

          <xa>true</xa>

          <ha>true</ha>

          <!-- Pause 1 second between connect attempts -->

          <retry-interval>1000</retry-interval>

          <!-- Multiply subsequent reconnect pauses by this multiplier. This can be used to

          implement an exponential back-off. For our purposes we just set to 1.0 so each reconnect

          pause is the same length -->

          <retry-interval-multiplier>1.0</retry-interval-multiplier>

          <!-- Try reconnecting an unlimited number of times (-1 means "unlimited") -->

          <reconnect-attempts>-1</reconnect-attempts>

          <client-failure-check-period>20000</client-failure-check-period>

          <failover-on-server-shutdown>true</failover-on-server-shutdown>

          <failover-on-initial-connection>true</failover-on-initial-connection>

          <discovery-group-ref discovery-group-name="dg-group1"/>

          <connectors>

            <connector-ref connector-name="netty"/>

          </connectors>

          <entries>

            <entry name="/SpecialConnectionFactory"/>

          </entries>

          <connection-load-balancing-policy-class-name>org.hornetq.api.core.client.loadbalance.RandomConnectionLoadBalancingPolicy</connection-load-balancing-policy-class-name>

        </connection-factory>

      ..

        <queue name="ExpiryQueue">

          <entry name="/queue/ExpiryQueue"/>

        </queue>

      In the configuration I set the redustribution deplay to 0

        <address-settings>

            <!--default for catch all-->

          <address-setting match="#">

            <dead-letter-address>jms.queue.DLQ</dead-letter-address>

            <expiry-address>jms.queue.ExpiryQueue</expiry-address>

            <redistribution-delay>0</redistribution-delay>

            <redelivery-delay>0</redelivery-delay>

            <max-size-bytes>1048576000</max-size-bytes>

            <message-counter-history-day-limit>10</message-counter-history-day-limit>

            <address-full-policy>BLOCK</address-full-policy>

          </address-setting>

          <address-setting match="jms.#">

            <redistribution-delay>0</redistribution-delay>

          </address-setting>

        </address-settings>

      I am busy testing that the configuration is stable and reliant, but I have got some problems with messages not being delivered.

       

      My JMS client application sends a series of messages to the queue.

       

      InitialContext initialContext = new InitialContext();

      ConnectionFactory connectionFactory = (ConnectionFactory) initialContext.lookup("/SpecialConnectionFactory");

      connection = connectionFactory.createConnection();

      connection.setExceptionListener(this);

      connection.start();

      destination = (Destination) initialContext.lookup("/queue/ExpiryQueue");

      session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);

      producer = session.createProducer(destination);

      ..//Loop for X iterations

      TextMessage message = session.createTextMessage("Message :" + index);

      producer.send(message);

      ..

      producer.close();

      initialContext.close();

      connection.close();

      Running this application results in the messages being distributed equally across both nodes.

       

      However when the receiving application connects to the server it only recieved the message from one node, plus one message from the other. So lets say I sent 3000 messages initially. These are distributed with 1500 on each node. When the reciveing application connects and pulls message off it is given 1501 on the first attempt and 1499 on the second, but never all 3000. So what I did to get around this is to implement a small delay which waits if there are no more messages on the queue and checks again after one seccond.

      InitialContext initialContext = new InitialContext(props);

      ConnectionFactory connectionFactory = (ConnectionFactory) initialContext.lookup("/SpecialConnectionFactory");

      connection = connectionFactory.createConnection();

      connection.start();

      Destination queue = (Destination) initialContext.lookup("/queue/ExpiryQueue");

      Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);

      MessageConsumer consumer = session.createConsumer(queue);

      TextMessage message = null;

      boolean almostDone=false;

      boolean reallyDone=false;

      do {

              message = (TextMessage) consumer.receive(5000);

              if (!almostDone && message==null){

                //we reached the end of the available messages. try again in one second

                almostDone = true;

                Thread.sleep(1000);

              }

              else if (almostDone && message==null){

                //now we are really done as there are not more message

                reallyDone = true;

              }

              else if (almostDone && message!=null){

                //got more message. resetting done status

                almostDone = false;

              }       

      }

      while (!reallyDone);

       

      Now here are my questions:

       

      1.) Is this the best solution to deal with the message distribution in a cluster. What other options are there. I really don't like the idea of having to check twice if the end of the queue has been reached.

       

      2.) In the JDNI configuration I include a list of all 4 servers (active and stand-by)

      java.naming.provider.url=192.168.0.21:1099,192.168.0.24:2099,192.168.0.24:1099,192.168.0.21:2099

      192.168.0.21:1099 - Server 1 Active

      192.168.0.24:2099 - Server 2 Active

      192.168.0.24:1099 - Server 1 Backup

      192.168.0.21:2099 - Server 2 Backup

       

      In the sender example above I am setting the exception listener on the connection "connection.setExceptionListener(this);"

      public void onException(final JMSException exception) {

          try {

              System.err.println("Exception Handling " + exception.getMessage());

              producer.close();

              initialContext.close();

              connection.close();

              Thread.sleep(10000);

              ConnectionFactory connectionFactory = (ConnectionFactory) initialContext.lookup("/SpecialConnectionFactory");

              connection = connectionFactory.createConnection();

              connection.setExceptionListener(this);

              connection.start();

          } catch (Exception e) {

              System.err.println("Exception Handling : Failed to handle failover");

              e.printStackTrace();

          }

      }

      This doesn't seems to work as the following exception occurs. For some reason it never re-establishes a working session

      javax.jms.IllegalStateException: Session is closed

          at org.hornetq.jms.client.HornetQSession.checkClosed(HornetQSession.java:1008)

          at org.hornetq.jms.client.HornetQSession.createTextMessage(HornetQSession.java:194)

          at com.abc.ClientSender.runTest(ClientSender.java:63)

          at com.abc.ClientSender.main(ClientSender.java:23)

      3.) In this configuration only two servers have the JNDI resources avaialble. So should server 1 - active fail, it's stand-by component takes over and when it is restored again, it becomes the active server. My problem is  that when I look-up resources in JNDI depending on the state of the server, I get a not found exception. I understand that the resource it not available on stand-by servers, however given that I don't have, nor want, control over which server is available is there a way to ensure the client only queries active servers for JNDI resources. Querying a server in stanb-by mode will invetiable result in a not-found exception. Any suggestion on how best to deal with this situation.