1 2 Previous Next 26 Replies Latest reply on May 31, 2007 4:11 AM by timfox

1.2.0.GA transparent node failover does not always work

bander Mar 5, 2007 1:43 AM

Following on from this thread: http://www.jboss.org/index.html?module=bb&op=viewtopic&t=102491

I'm currently experiencing multiple failover issues with the 1.2.0.GA release. I'm running two clustered nodes on my local machine (JB4.0.4, Win XP, JVM1.4.2) using all the default settings, following the clustered node instructions in the user guide.

After starting both messaging-node0 and messaging-node1 I start my test case (attached).

The first problem I have with the test case that I created is that the message listener does not receive any of the dispatched messages (the test case creates a message dispatcher and message listener - the dispatcher sends a message to a queue that the listener is attached to). This happens regardless of the queue type (i.e. clustered/non-clustered - in this case testDistributedQueue or testQueue).
The only way I can get the listener to start receiving messages is to kill one of the nodes e.g. kill node0.

Initially I thought my listener may have ended up on a different node to the dispatcher, so it could not see the messages that were being dispatched but I thought JBoss Messaging handles this scenario?

The second issue is that it's pretty easy to stop messages being dispatched and received altogether by randomly stopping and starting the individual nodes e.g. stop both nodes and bring one back up - my test case was unable to get a connection after both nodes had been shut down.

I'm interested to know if anyone is seeing similar behaviour.

Ben

import java.util.Hashtable;

import javax.jms.Connection;
import javax.jms.ConnectionFactory;
import javax.jms.ExceptionListener;
import javax.jms.JMSException;
import javax.jms.Message;
import javax.jms.MessageConsumer;
import javax.jms.MessageListener;
import javax.jms.MessageProducer;
import javax.jms.Queue;
import javax.jms.Session;
import javax.naming.Context;
import javax.naming.InitialContext;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class ReconnectTest {

 class DispatcherThread extends Thread {
 private ConnectionFactory connectionFactory;

 private String id;

 private boolean initialised = false;

 private Queue queue;

 private boolean recycle = false;

 private boolean shutdown = false;

 public DispatcherThread(ConnectionFactory connectionFactory,
 Queue queue, String id) {
 super();
 this.connectionFactory = connectionFactory;
 this.queue = queue;
 this.id = id;
 this.setName(id);
 }

 private boolean isRecycle() {
 return recycle;
 }

 public void run() {
 Connection connection = null;
 Session session = null;
 MessageProducer producer = null;
 ExceptionListener exceptionListener = null;

 while (!shutdown) {
 if (!initialised) {
 try {
 connection = connectionFactory.createConnection();
 exceptionListener = new ExceptionListener() {
 public void onException(JMSException ex) {
 LOG.error("Received connection exception", ex);
 recycle = true;
 }
 };
 connection.setExceptionListener(exceptionListener);
 session = connection.createSession(false,
 Session.AUTO_ACKNOWLEDGE);
 producer = session.createProducer(queue);
 LOG.info(id + " initialised");
 initialised = true;
 } catch (JMSException ex) {
 LOG.error("Caught exception during initialisation", ex);
 recycle = true;
 }
 }
 if (isRecycle()) {
 JMSHelper.close(producer);
 JMSHelper.close(session);
 JMSHelper.close(connection);
 initialised = false;
 recycle = false;
 }
 if (initialised && (!recycle) && (!shutdown)) {
 try {
 Thread.sleep(1000);
 Message message = session
 .createTextMessage("This is a test");
 producer.send(message);
 LOG.info(id + " dispatched message");
 } catch (Exception ex) {
 LOG.error("Caught exception during send", ex);
 recycle = true;
 }
 }
 }
 }

 public void shutdown() {
 LOG.info(id + " is shutting down");
 recycle = true;
 shutdown = true;
 }
 }

 static class JMSHelper {
 public static void close(Connection connection) {
 if (connection != null) {
 try {
 connection.close();
 } catch (Exception ex) {
 LOG.error("Caught exception when closing connection", ex);
 }
 connection = null;
 }
 }

 public static void close(MessageConsumer consumer) {
 if (consumer != null) {
 try {
 consumer.close();
 } catch (Exception ex) {
 LOG.error("Caught exception when closing consumer", ex);
 }
 }
 consumer = null;
 }

 public static void close(MessageProducer producer) {
 if (producer != null) {
 try {
 producer.close();
 } catch (Exception ex) {
 LOG.error("Caught exception when closing producer", ex);
 }
 }
 producer = null;
 }

 public static void close(Session session) {
 if (session != null) {
 try {
 session.close();
 } catch (Exception ex) {
 LOG.error("Caught exception when closing session", ex);
 }
 }
 session = null;
 }
 }

 class ListenerManagerThread extends Thread {
 private ConnectionFactory connectionFactory;

 private String id;

 private boolean initialised = false;

 private MessageListener messageListener;

 private Queue queue;

 private boolean recycle = false;

 private boolean shutdown = false;

 public ListenerManagerThread(ConnectionFactory connectionFactory,
 Queue queue, MessageListener messageListener, String id) {
 super();
 this.connectionFactory = connectionFactory;
 this.queue = queue;
 this.messageListener = messageListener;
 this.id = id;
 this.setName(id);
 }

 private boolean isRecycle() {
 return recycle;
 }

 public void run() {
 Connection connection = null;
 Session session = null;
 MessageConsumer consumer = null;
 ExceptionListener exceptionListener = null;

 while (!shutdown) {
 if (!initialised) {
 try {
 connection = connectionFactory.createConnection();
 exceptionListener = new ExceptionListener() {
 public void onException(JMSException ex) {
 LOG.error("Received connection exception", ex);
 recycle = true;
 }
 };
 connection.setExceptionListener(exceptionListener);
 session = connection.createSession(false,
 Session.AUTO_ACKNOWLEDGE);
 consumer = session.createConsumer(queue);
 consumer.setMessageListener(messageListener);
 connection.start();
 LOG.info(id + " initialised");
 initialised = true;
 } catch (JMSException ex) {
 LOG.error("Caught exception during initialisation", ex);
 recycle = true;
 }
 }
 if (isRecycle()) {
 JMSHelper.close(consumer);
 JMSHelper.close(session);
 JMSHelper.close(connection);
 initialised = false;
 recycle = false;
 }
 try {
 Thread.sleep(1000);
 } catch (InterruptedException ex) {
 LOG.error("Caught exception during sleep");
 }
 }
 }

 public void shutdown() {
 LOG.info(id + " is shutting down");
 recycle = true;
 shutdown = true;
 }
 }

 class SimpleListener implements MessageListener {

 private String id;

 public SimpleListener(String id) {
 super();
 this.id = id;
 }

 public void onMessage(Message message) {
 LOG.info(id + " received message");
 }

 }

 private static final Log LOG = LogFactory.getLog(ReconnectTest.class);

 public static void main(String[] args) {
 ReconnectTest test = new ReconnectTest();

 try {
 test.start();
 } catch (Throwable ex) {
 LOG.error("Caught exception in main", ex);
 }
 }

 private void start() throws Exception {
 // Setup connection 1
 Hashtable properties1 = new Hashtable();
 properties1.put(Context.INITIAL_CONTEXT_FACTORY,
 "org.jnp.interfaces.NamingContextFactory");
 properties1.put(Context.URL_PKG_PREFIXES,
 "org.jboss.naming:org.jnp.interfaces");
 properties1.put(Context.PROVIDER_URL, "jnp://localhost:1099");
 properties1.put(Context.SECURITY_PRINCIPAL, "admin");
 properties1.put(Context.SECURITY_CREDENTIALS, "admin");

 ConnectionFactory connectionFactory1 = null;
 Queue queue1 = null;
 Context context1 = null;

 context1 = new InitialContext(properties1);
 connectionFactory1 = (ConnectionFactory) context1
 .lookup("ConnectionFactory");
 queue1 = (Queue) context1.lookup("/queue/testDistributedQueue");

 MessageListener listener1 = new SimpleListener("Listener.1");
 ListenerManagerThread manager1 = new ListenerManagerThread(
 connectionFactory1, queue1, listener1, "ListenerManager.1");
 manager1.start();

 DispatcherThread dispatcher1 = new DispatcherThread(connectionFactory1,
 queue1, "Dispatcher.1");
 dispatcher1.start();

 Thread.sleep(Long.MAX_VALUE);

 manager1.shutdown();
 manager1.join();

 dispatcher1.shutdown();
 dispatcher1.join();

 context1.close();
 }
}

1. Re: 1.2.0.GA transparent node failover does not always work

timfox Mar 5, 2007 4:32 PM (in response to bander)

Ben-

We're not currently seeing the problems you describe, but I'll try and run your test some time this week to see what's going on.
Actions
2. Re: 1.2.0.GA transparent node failover does not always work

timfox Mar 8, 2007 4:51 PM (in response to bander)

"bander" wrote:
Following on from this thread: http://www.jboss.org/index.html?module=bb&op=viewtopic&t=102491

I'm currently experiencing multiple failover issues with the 1.2.0.GA release. I'm running two clustered nodes on my local machine (JB4.0.4, Win XP, JVM1.4.2) using all the default settings, following the clustered node instructions in the user guide.

After starting both messaging-node0 and messaging-node1 I start my test case (attached).

The first problem I have with the test case that I created is that the message listener does not receive any of the dispatched messages (the test case creates a message dispatcher and message listener - the dispatcher sends a message to a queue that the listener is attached to). This happens regardless of the queue type (i.e. clustered/non-clustered - in this case testDistributedQueue or testQueue).
The only way I can get the listener to start receiving messages is to kill one of the nodes e.g. kill node0.

Looking at your code, I see you are creating the first dispatcher connection to node 0 and the first listener connection to node1. The clustered connection factory will create subsequent connections on different nodes according to (by default) a round robin policy.

JBossMessaging clustering can be configured in different ways according to the type of application you are running.

The most common type of clustered app is a bank of servers with homogenous MDBs deployed on each node (i.e. each node has the same set of MDBs) and producers evenly distributed acros nodes sending messages. In such a configuration it makes sense for the local queue to always get the message - i.e. there is no point redistributing it to another node. This is the default config.

So in your case you are not initially seeing your messages being consumed since your consumer is on a different node to your producer.

When one of the nodes is killed, both connections end up on the same node hence you see the messages being consumed.

There are several different common application "types" and JBM can be configured for all of them. Check out the section on clustering configuration in the 1.2 user guide for more info. The documentation is also due to be fleshed out in more detail soon too.

In short, if your producers are well distributed across the cluster then you should choose the default cluster routing policy which always favours the local queue, otherwise you can use a round robin cluster router policy.

If your consumers are well distributed across the cluster then you do not need message redistribution (i.e. you should use the NullMessagePullPolicy) otherwise you can use the DefaultMessagePullPolicy.

Also bear in mind, that a topolgy where you have just one producer on one node and a single consumer on a different node like your test case is probably not much of a real world scenario, (why would you want to deploy you application this way?), although we should of course cope with this (and we do).

I have successfully run your testcase and I am seeing expected behaviour so far. I have killed alternating servers many times and I am seeing failover occurring fine. We also have a test that runs as part of the cruisecontrol run that does this and it seems to be working.

Can you give me any more details as to what errors you are seeing?

The second issue is that it's pretty easy to stop messages being dispatched and received altogether by randomly stopping and starting the individual nodes e.g. stop both nodes and bring one back up - my test case was unable to get a connection after both nodes had been shut down.

I don't understand this. In order for the client to successfully send/receive messages you need at least one node in the cluster to be operational.

If you shutdown all the nodes in the cluster then clearly nothing is going to work, the client needs to talk to a server.
Actions
3. Re: 1.2.0.GA transparent node failover does not always work

bander Mar 8, 2007 6:00 PM (in response to bander)

"timfox" wrote:

Looking at your code, I see you are creating the first dispatcher connection to node 0 and the first listener connection to node1.

My code is completely unaware of node 0 and node 1, but I know what you meant.

"timfox" wrote:
The clustered connection factory will create subsequent connections on different nodes according to (by default) a round robin policy.

Ok - so that bit is a configuration issue. Fair enough.

"timfox" wrote:
Also bear in mind, that a topolgy where you have just one producer on one node and a single consumer on a different node like your test case is probably not much of a real world scenario, (why would you want to deploy you application this way?), although we should of course cope with this (and we do).

My test case does not know about the different nodes - it has simply requested multiple connections from the connection factory. It sounds like I'm not using the correct JBoss configuration for my test case. In the scenario where the messaging client had connected consumers to one node and producers to another node we would want JBoss Messaging to ensure the messages on the queues were distributed to the nodes with active consumers.

"timfox" wrote:
I have successfully run your testcase and I am seeing expected behaviour so far. I have killed alternating servers many times and I am seeing failover occurring fine. We also have a test that runs as part of the cruisecontrol run that does this and it seems to be working.

Ok - this is the more critical issue for us. I can break it consistently and within a very short time. For our testing we're using WinXP and JVM 1.4.2-b28. Which JVM are you using? Have you changed any of the configuration defaults?

"timfox" wrote:
Can you give me any more details as to what errors you are seeing?

Ok - I'll capture the log output next time and post it.

"timfox" wrote:
In order for the client to successfully send/receive messages you need at least one node in the cluster to be operational.
If you shutdown all the nodes in the cluster then clearly nothing is going to work, the client needs to talk to a server.

Obviously. I what I meant was that after shutting both nodes down for a period and then restarting one the test case should reconnect and start dispatching/receiving again. Shutting down both servers seemed to cause problems in the client i.e. it failed to detect that one of the nodes had been restarted.
Actions
4. Re: 1.2.0.GA transparent node failover does not always work

timfox Mar 8, 2007 6:53 PM (in response to bander)

"bander" wrote:

My test case does not know about the different nodes - it has simply requested multiple connections from the connection factory. It sounds like I'm not using the correct JBoss configuration for my test case. In the scenario where the messaging client had connected consumers to one node and producers to another node we would want JBoss Messaging to ensure the messages on the queues were distributed to the nodes with active consumers.

Basically JBM clustering needs to be tuned according to your application topology. Different types of applications have different requirements for load balancing and redistribution.

So first you need to determine what kind of application you have then tune based on that.

The most common use of clustering is a bank of servers with the same MDBs on each node. In such a case moving messages from one node to another is not required. It always makes sense to favour the local node. There are other topologies where there are more consumers on one node than another or consumers with different throughput on different nodes. In these cases redistribution can be configured.

There is a chapter in the user guide on this which I am going to expand upon in the future.

Ok - this is the more critical issue for us. I can break it consistently and within a very short time. For our testing we're using WinXP and JVM 1.4.2-b28. Which JVM are you using? Have you changed any of the configuration defaults?

I am using WinXP and JDK1.5 and all the default settings. The cruisecontrol run is using Linux on multi processor intel.

Obviously. I what I meant was that after shutting both nodes down for a period and then restarting one the test case should reconnect and start dispatching/receiving again. Shutting down both servers seemed to cause problems in the client i.e. it failed to detect that one of the nodes had been restarted.

The cluster requires at least one node to be up to remain a cluster. If all nodes fail then the clients will fail.

If you're expecting the client to keep on retrying if the *entire* cluster disappears in the hope it will eventually come up, then this is currently not supported. I'm not sure which other JMS providers support this either - JBoss MQ certainly doesn't.

Typically you would design the cluster with enough nodes such that the probability of all servers simultaneously failing is vanishingly small.

Client retrying is not something we normally consider under the banner of "clustering" - but we have a task for it which is due to be complete in 1.2.1. This feature would be useful in scenarios such as ATM machines or POS terminals where the connection to the "main office" is unstable.

If you feel this is an important feature then you can vote for and bump the priority of the feature request.

Right now we have a message bridge which is resilient to failure so you can always use this to bridge your "unstable" connection.
Actions
5. Re: 1.2.0.GA transparent node failover does not always work

bander Mar 8, 2007 7:23 PM (in response to bander)

"timfox" wrote:
The cluster requires at least one node to be up to remain a cluster. If all nodes fail then the clients will fail.

Of course, because there is nothing to connect to! I'm talking about the situation where both nodes are shutdown and then one node (or both, it does not matter) is then restarted. Surely we should be able to connect to that node now!

"timfox" wrote:

If you're expecting the client to keep on retrying if the *entire* cluster disappears in the hope it will eventually come up, then this is currently not supported. I'm not sure which other JMS providers support this either - JBoss MQ certainly doesn't.

I'm not expecting the jboss-messaging-client.jar to keep retrying! We do the retrying ourselves. If we do not get a connection to the JMS server in our test case (because its down, crashed, whatever) we loop and try to get a connection from the connection factory again.

What I'm seeing is an inability to obtain a successful connection to the JMS server after both are shut down and brought back up. It's as if the jboss-messaging-client.jar can no longer see the restarted JMS server(s).

This is an extremely simple test case. The JMS server is either there or it isn't. We should not have to reboot our application if the JMS server is shutdown. If we manually try to reconnect to the JMS server at regular intervals then we should reconnect successfully when the JMS server is restarted. This test case works under SunMQ, ActiveMQ, OpenJMS, OracleJMS. It is not consistently working under JBoss Messaging.

Here is the output of my latest test run.

http://ben.customer.netspace.net.au/reconnectTestOutput.zip
Actions
6. Re: 1.2.0.GA transparent node failover does not always work

timfox Mar 8, 2007 7:34 PM (in response to bander)

How are you "killing" your servers?

From your log ouput my first impression is that you're shutting them down cleanly rather than simulating them crashing.

Shutting down cleanly won't trigger failover.
Actions
7. Re: 1.2.0.GA transparent node failover does not always work

bander Mar 8, 2007 7:45 PM (in response to bander)

"timfox" wrote:
How are you "killing" your servers?

They're running in a DOS window (or whatever MS are calling them now). I've just been issuing a Control-C.
Actions
8. Re: 1.2.0.GA transparent node failover does not always work

bander Mar 8, 2007 8:03 PM (in response to bander)

"timfox" wrote:
How are you "killing" your servers?
Shutting down cleanly won't trigger failover.

But surely the connections associated with that node will start to fail? At that point our app catches the exceptions and requests a new connection from the connection factory - I would assume we would then receive a connection to one of the nodes that was still operational...
Actions
9. Re: 1.2.0.GA transparent node failover does not always work

timfox Mar 8, 2007 10:06 PM (in response to bander)

"bander" wrote:
"timfox" wrote:
How are you "killing" your servers?

They're running in a DOS window (or whatever MS are calling them now). I've just been issuing a Control-C.

Ok, that explains a lot. CTRL-C will cause a clean shutdown of the server, and the client won't attempt failover.

Try killing the jboss server process in task manager and you'll see a big difference
Actions
10. Re: 1.2.0.GA transparent node failover does not always work

timfox Mar 8, 2007 10:35 PM (in response to bander)

"bander" wrote:
"timfox" wrote:
How are you "killing" your servers?
Shutting down cleanly won't trigger failover.

But surely the connections associated with that node will start to fail? At that point our app catches the exceptions and requests a new connection from the connection factory - I would assume we would then receive a connection to one of the nodes that was still operational...

You probably will get an exception, but I wouldn't bet on it. A particular jms implementation may choose to close its client connections or move them cleanly to a different server when a node is cleanly shutdown.

Having said you should receive an exception in this case with JBM (in the future this may not be the case), so why you're not getting it I don't know - I need to investigate.
Actions
11. Re: 1.2.0.GA transparent node failover does not always work

bander Mar 8, 2007 11:17 PM (in response to bander)

I have reconfigured the nodes to get the message distribution working the way we need it to, so now the listener receives messages even if the producer is on a different node. I'm happy with that bit.

The whole reconnect/failover issue is another matter though (as we're having a very similar issue with 1.0.1.SP4).

I've done another test, this time killing the JBM processes instead of using Ctrl-C.

It took a little bit longer but I was able to get into the same state i.e. one node running but the test case unable to get a connection to it.

Log output is here http://ben.customer.netspace.net.au/reconnectTestOutputKilledServers.zip

Just on the subject of killing JBoss instead of shutting it down - what happens when the server hosting JBoss Messaging gets rebooted? Is that considered a 'kill' or 'shutdown'?

It's going to make our production support very difficult if rebooting the machine hosting JBM prevents us from reconnecting to it!
Actions
12. Re: 1.2.0.GA transparent node failover does not always work

timfox Mar 9, 2007 1:13 AM (in response to bander)

I think the root of this problem is you are not using JBoss Messaging's transparent failover abilities but are catching exceptions yourself and recreating connections manually (which is unnecessary for JBM but necessary for other messaging systems which don't support transparent failover), and this is interacting badly with the transparent failover.

This is a somewhat unusual use case but I admit it should work. Actually we are grateful for these kinds of cases since they get to exercise the darker edge cases which wouldn't otherwise get excercised. Nevertheless I will try and get to the root of the problem to find out the cause.

BTW reconnecting to a failed server should work fine, so I am baffled what is going on there. In fact this is exactly what the message bridge does - if one of its servers goes down it retries at intervals and reconnects when the node is up - this works like a treat.
Actions
13. Re: 1.2.0.GA transparent node failover does not always work

timfox Mar 9, 2007 1:30 AM (in response to bander)

Since it seems you're not interested in the load balancing/automatic failover abilities of JBM, you could just use the non clustered connection factory at /NonClusteredConnectionFactory to create connections.

If you use this then JBM won't attempt automatic failover. If you want to connect to different servers you need to connect to the JNDI for each server of interest and use the appropriate connection factory from each.

See connection-factories-service.xml to see how this is condifgured.
Actions
14. Re: 1.2.0.GA transparent node failover does not always work

bander Mar 10, 2007 3:10 AM (in response to bander)

"timfox" wrote:
I think the root of this problem is you are not using JBoss Messaging's transparent failover abilities but are catching exceptions yourself and recreating connections manually (which is unnecessary for JBM but necessary for other messaging systems which don't support transparent failover), and this is interacting badly with the transparent failover.

If failover is transparent I would have thought there would be no exceptions to catch (in our code), unless the failover itself failed and the requested operation could not be performed by any node in the cluster.

We're catching the exceptions thrown by the JMS API calls we make. Isn't it part of the JMS spec that we have to catch exceptions ourselves?

The JMS API for Connection says

If a JMS provider detects a serious problem with a connection it will inform the connection's ExceptionListener if one has been registered. It does this by calling the listener's onException() method passing it a JMSException describing the problem.

This allows a client to be asynchronously notified of a problem. Some connections only consume messages so they would have no other way to learn their connection has failed.

A Connection serializes execution of its ExceptionListener.

A JMS provider should attempt to resolve connection problems itself prior to notifying the client of them.

What action can we realistically take if the connection exception listener tells us there is a problem? The simplest thing to do is attempt to close the connection and obtain a new one.

If you're saying that we should not be recreating connections manually with JBM then our JMS code will no longer be generic - it will be tied to JBM. This is not desirable.

"timfox" wrote:

This is a somewhat unusual use case but I admit it should work.

I'm surprised you consider it unusual. I would have thought our main requirement is a common one: we need to be able to restart our messaging servers in a production environment independently from our web application (which is likely to be running on a separate server). When the messaging servers are stopped the messaging infrastructure in our web application must be able to disconnect. When the messaging servers are restarted our messaging infrastructure must be able to reconnect and begin processing messages again (i.e. recreate the consumers and producers).

Maybe it would be easier if you showed me what the test case should look like i.e. your challenge is to write a test case that will successfully dispatch and receive messages when a JMS server is available - and continues to run when a JMS server is not available!
Actions

1 2 Previous Next

Go to original post