Problem with live/backup pairs in a cluster
grantlittle Nov 11, 2010 11:28 PMHi all,
I have attached a zip file containing a test scenario of what we are trying to do in a "real" environment.
The scenario is basically this.
- Start up a HornetQ cluster with 2 live/backup pairs in it.
- Have a single sender (in a separate thread) which sends messages to a queue (ExampleQueue).
- Have a single consumer (in a separate thread) which consumes messages from that queue.
- Even though there are 2 live servers and only one consumer registered at one live/backup pair all the messages should be received by the consumer.
- I have a class (Reconciler) that registers the message ids of messages that have been sent, the same class is used in the consumer to tick off those messages which have been received, thereby showing which ones have not been consumed.
- Shutdown one of the live/backup pairs and ensure it can rejoin the cluster with no loss of data or duplicates.
The test takes the following steps.
- Startup all 4 servers
- Send some messages through and make sure all are received
- Shutdown the first live server (server1). Allow time for connections to re-establish.
- Make sure no messages have been lost or that no duplicates have been received.
- Continue to send some messages for a period and again check that none go missing or no duplicates are received.
- Shutdown the backup server.
- Make sure all connections failover to the remaining live/backup pair. (Although they may failover it is possible that some messages are not received by the consumer as there is no replication out from the backup server. However it should be possible to recover these messages later by syncing the data directory to its live server and then starting the backup and live server again. At this point those messages should be delivered to the consumer).
- Send some more messages to make sure processing continues without any further message loss.
- Sync the backup and live server data
- Restart the backup and then live servers
- Allow some time for messages to be consumed
- Send some more messages.
- Ensure that all messages have been received and that there are no duplicates.
It should be possible to run the test by doing the following.
- Download a distribution of hornetq (I'm using 2.1.2).
- Unzip the attached zip file to $HORNETQ_HOME/examples/jms
- Open a command prompt and cd into $HORNETQ_HOME/examples/jms/clustered-standalone-failover
- Run ./build.sh
- Watch for the outcome of test. You may need to run the test a number of times as you don't (well I don't) always get the same outcome.
I ran this scenario 20 times with a number of different outcomes
- Sometimes not all of the messages are received by the consumer at step 4.
- Sometimes the consumer has received duplicate messages at step 4.
- The live server cannot start again at step 10. I think this is due to cluster bridge trying to make a connection before the live server and established a connection to the backup.
There is no exception for outcomes 1 & 2 as such. It is simply a case of the Reconciler instance either being asked to remove an object from its cache more than once or items being left in the Reconcilers cache.
The resultant exception for scenario 3 is:
[java] HornetQServer_1 err:DEPLOYMENTS IN ERROR:
[java] HornetQServer_1 err: Deployment "JMSServerManager" is in error due to: HornetQException[errorCode=104 message=Connected server is not a backup server]
[java] HornetQServer_1 err:
[java] HornetQServer_1 err: at org.jboss.kernel.plugins.deployment.AbstractKernelDeployer.internalValidate(AbstractKernelDeployer.java:278)
[java] HornetQServer_1 err: at org.jboss.kernel.plugins.deployment.AbstractKernelDeployer.validate(AbstractKernelDeployer.java:174)
[java] HornetQServer_1 err: at org.hornetq.integration.bootstrap.HornetQBootstrapServer.bootstrap(HornetQBootstrapServer.java:158)
[java] HornetQServer_1 err: at org.jboss.kernel.plugins.bootstrap.AbstractBootstrap.run(AbstractBootstrap.java:83)
[java] HornetQServer_1 err: at org.hornetq.integration.bootstrap.HornetQBootstrapServer.run(HornetQBootstrapServer.java:116)
[java] HornetQServer_1 err: at org.hornetq.common.example.SpawnedHornetQServer.main(SpawnedHornetQServer.java:35)
[java] HornetQServer_1 out:DEPLOYMENTS IN ERROR:
[java] HornetQServer_1 out: Deployment "JMSServerManager" is in error due to: HornetQException[errorCode=104 message=Connected server is not a backup server]
[java] HornetQServer_1 out:
[java] java.lang.RuntimeException: server failed to start
[java] at org.hornetq.common.example.SpawnedVMSupport.spawnVM(SpawnedVMSupport.java:154)
[java] at org.hornetq.common.example.HornetQExample.startServer(HornetQExample.java:144)
[java] at org.hornetq.jms.example.ClusteredStandaloneWithNoConsumersExample.runExample(ClusteredStandaloneWithNoConsumersExample.java:148)
[java] at org.hornetq.common.example.HornetQExample.run(HornetQExample.java:71)
[java] at org.hornetq.jms.example.ClusteredStandaloneWithNoConsumersExample.main(ClusteredStandaloneWithNoConsumersExample.java:62)
[java]
NOTE: I have also tried this scenario using the TRUNK version with the same outcome.
I have yet to be able to run this scenario through to completion.
I'm guessing my issues are due to configuration, some misunderstanding of HornetQ or an issue with the test scenario itself.
I would appreciate any help in trying to get this scenario to work.
Thanks,
Grant