6 Replies Latest reply on Dec 11, 2012 2:36 AM by ejb3workshop

Cluster configuration and missing messages

ejb3workshop Dec 7, 2012 1:47 AM

I have a cluster of two active hornetq installations as well as corresponding backup nodes for each of the active servers.These are distributed across two systems.

System1	System2
Active Server1	Active Server 2
Backup Server 2	Backup Server 1
Shared NFS directories

Active Server 1 and Backup Server 1 share a common directory on an NFS share. Similarly Active Server 2 and Backup Server 2 do the same on a different directory.

My aim is to achiave high availabity and distribute the work load between two systems.

The connection factory is configured as follows:

<connection-factory name="NettyConnectionFactory">
    <xa>true</xa>
    <ha>true</ha>
    
    <retry-interval>1000</retry-interval>
    
    <retry-interval-multiplier>1.0</retry-interval-multiplier>
    
    <reconnect-attempts>-1</reconnect-attempts>
    <client-failure-check-period>20000</client-failure-check-period>
    <failover-on-server-shutdown>true</failover-on-server-shutdown>
    <failover-on-initial-connection>true</failover-on-initial-connection>
    <discovery-group-ref discovery-group-name="dg-group1"/>
    <connectors>
      <connector-ref connector-name="netty"/>
    </connectors>
    <entries>
      <entry name="/SpecialConnectionFactory"/>
    </entries>
    <connection-load-balancing-policy-class-name>org.hornetq.api.core.client.loadbalance.RandomConnectionLoadBalancingPolicy</connection-load-balancing-policy-class-name>
</connection-factory>
..
<queue name="ExpiryQueue">
    <entry name="/queue/ExpiryQueue"/>
</queue>

In the configuration I set the redustribution deplay to 0

<address-settings>
      
    <address-setting match="#">
      <dead-letter-address>jms.queue.DLQ</dead-letter-address>
      <expiry-address>jms.queue.ExpiryQueue</expiry-address>
      <redistribution-delay>0</redistribution-delay>
      <redelivery-delay>0</redelivery-delay>
      <max-size-bytes>1048576000</max-size-bytes>
      <message-counter-history-day-limit>10</message-counter-history-day-limit>
      <address-full-policy>BLOCK</address-full-policy>
    </address-setting>
    <address-setting match="jms.#">
      <redistribution-delay>0</redistribution-delay>
    </address-setting>
</address-settings>

I am busy testing that the configuration is stable and reliant, but I have got some problems with messages not being delivered.

My JMS client application sends a series of messages to the queue.

InitialContext initialContext = new InitialContext();
ConnectionFactory connectionFactory = (ConnectionFactory) initialContext.lookup("/SpecialConnectionFactory");
connection = connectionFactory.createConnection();
connection.setExceptionListener(this);
connection.start();
destination = (Destination) initialContext.lookup("/queue/ExpiryQueue");
session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
producer = session.createProducer(destination);
..//Loop for X iterations
TextMessage message = session.createTextMessage("Message :" + index);
producer.send(message);
..
producer.close();
initialContext.close();
connection.close();

Running this application results in the messages being distributed equally across both nodes.

However when the receiving application connects to the server it only recieved the message from one node, plus one message from the other. So lets say I sent 3000 messages initially. These are distributed with 1500 on each node. When the reciveing application connects and pulls message off it is given 1501 on the first attempt and 1499 on the second, but never all 3000. So what I did to get around this is to implement a small delay which waits if there are no more messages on the queue and checks again after one seccond.

InitialContext initialContext = new InitialContext(props);
ConnectionFactory connectionFactory = (ConnectionFactory) initialContext.lookup("/SpecialConnectionFactory");
connection = connectionFactory.createConnection();
connection.start();
Destination queue = (Destination) initialContext.lookup("/queue/ExpiryQueue");
Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
MessageConsumer consumer = session.createConsumer(queue);
TextMessage message = null;
boolean almostDone=false;
boolean reallyDone=false;
do {
        message = (TextMessage) consumer.receive(5000);
        if (!almostDone && message==null){
          //we reached the end of the available messages. try again in one second
          almostDone = true;
          Thread.sleep(1000);
        }
        else if (almostDone && message==null){
          //now we are really done as there are not more message
          reallyDone = true;
        }
        else if (almostDone && message!=null){
          //got more message. resetting done status
          almostDone = false;
        }
}
while (!reallyDone);

Now here are my questions:

1.) Is this the best solution to deal with the message distribution in a cluster. What other options are there. I really don't like the idea of having to check twice if the end of the queue has been reached.

2.) In the JDNI configuration I include a list of all 4 servers (active and stand-by)

java.naming.provider.url=192.168.0.21:1099,192.168.0.24:2099,192.168.0.24:1099,192.168.0.21:2099

192.168.0.21:1099 - Server 1 Active

192.168.0.24:2099 - Server 2 Active

192.168.0.24:1099 - Server 1 Backup

192.168.0.21:2099 - Server 2 Backup

In the sender example above I am setting the exception listener on the connection "connection.setExceptionListener(this);"

public void onException(final JMSException exception) {
    try {
        System.err.println("Exception Handling " + exception.getMessage());
        producer.close();
        initialContext.close();
        connection.close();
        Thread.sleep(10000);
        ConnectionFactory connectionFactory = (ConnectionFactory) initialContext.lookup("/SpecialConnectionFactory");
        connection = connectionFactory.createConnection();
        connection.setExceptionListener(this);
        connection.start();
    } catch (Exception e) {
        System.err.println("Exception Handling : Failed to handle failover");
        e.printStackTrace();
    }
}

This doesn't seems to work as the following exception occurs. For some reason it never re-establishes a working session

javax.jms.IllegalStateException: Session is closed
    at org.hornetq.jms.client.HornetQSession.checkClosed(HornetQSession.java:1008)
    at org.hornetq.jms.client.HornetQSession.createTextMessage(HornetQSession.java:194)
    at com.abc.ClientSender.runTest(ClientSender.java:63)
    at com.abc.ClientSender.main(ClientSender.java:23)

3.) In this configuration only two servers have the JNDI resources avaialble. So should server 1 - active fail, it's stand-by component takes over and when it is restored again, it becomes the active server. My problem is that when I look-up resources in JNDI depending on the state of the server, I get a not found exception. I understand that the resource it not available on stand-by servers, however given that I don't have, nor want, control over which server is available is there a way to ensure the client only queries active servers for JNDI resources. Querying a server in stanb-by mode will invetiable result in a not-found exception. Any suggestion on how best to deal with this situation.

1. Re: Cluster configuration and missing messages

clebert.suconic Dec 7, 2012 1:04 PM (in response to ejb3workshop)

The system will not redistribute if you have a consumer locally as it will be able to consume later.

Also: make sure the backup is not being activated and that your NFS is respecting distributed file locks. Perhaps you should first try without the backups and later add the backups just to isolate any possible issues with these backups not holding the distributed server.lock due to some NFS config.
Actions
2. Re: Cluster configuration and missing messages

ejb3workshop Dec 9, 2012 11:07 PM (in response to clebert.suconic)

The first problem occurs even with just two active nodes and a single consumer which forced me to use the almostDone hack / fudge shown above.
Any pointers on how to check if NFS is supporting distributed files locks.
Actions
3. Re: Cluster configuration and missing messages

clebert.suconic Dec 10, 2012 10:30 AM (in response to ejb3workshop)

It's not clear what version you are using...

Can you try a checkout of Branch_2_2_AS7?

git clone git://github.com/hornetq/hornetq.git
cd hornetq
git checkout Branch_2_2_AS7
./build.sh distro

(or ./mac-build.sh distro if you are on a mac)
Actions
4. Re: Cluster configuration and missing messages

ejb3workshop Dec 10, 2012 7:40 PM (in response to clebert.suconic)

I am using HornetQ 2.2.14 on CentOS Linux with AIO enabled.
Actions
5. Re: Cluster configuration and missing messages

clebert.suconic Dec 10, 2012 9:54 PM (in response to ejb3workshop)

Can you as a test try the checkout I asked you about?
Actions
6. Re: Cluster configuration and missing messages

ejb3workshop Dec 11, 2012 2:36 AM (in response to clebert.suconic)

It will take a day or so to update our environment with this. In the mean time we are still looking for some boiler-plate example of resilient cluster / client configuration. Our aim is to provide HA and have the client always be able to send / receive messages without any messages getting stuck or lost. Right now our configuration does not do this. Any pointers ?

I will report back on the new version.
Actions

Go to original post