0 Replies Latest reply on Nov 11, 2010 11:28 PM by grantlittle

    Problem with live/backup pairs in a cluster

    grantlittle

      Hi all,


      I have attached a zip file containing a test scenario of  what we are trying to do in a "real" environment.

      The scenario is basically this.

      1. Start up a HornetQ cluster with 2 live/backup pairs in it.
      2. Have a single sender (in a separate thread) which sends messages to a  queue (ExampleQueue).
      3. Have a single consumer (in a separate thread) which consumes messages  from that queue.
      4. Even though there are 2 live servers and only one consumer registered  at one live/backup pair all the messages should be received by the  consumer.
      5. I have a class (Reconciler) that registers the message ids of messages that have been  sent, the same class is used in the consumer to tick off those messages  which have been received, thereby showing which ones have not been consumed.
      6. Shutdown one of the live/backup pairs and ensure it can rejoin the  cluster with no loss of data or duplicates.


      The test takes the following steps.

      1. Startup all 4 servers
      2. Send some messages through and make sure all are received
      3. Shutdown the first live server (server1). Allow time for connections  to re-establish.
      4. Make sure no messages have been lost or that no duplicates have been  received.
      5. Continue to send some messages for a period and again check that none  go missing or no duplicates are received.
      6. Shutdown the backup server.
      7. Make sure all connections failover to the remaining live/backup pair.  (Although they may failover it is possible that some messages are not  received by the consumer as there is no replication out from the backup  server. However it should be possible to recover these messages later by  syncing the data directory to its live server and then starting the  backup and live server again. At this point those messages should be  delivered to the consumer).
      8. Send some more messages to make sure processing continues without any  further message loss.
      9. Sync the backup and live server data
      10. Restart the backup and then live servers
      11. Allow some time for messages to be consumed
      12. Send some more messages.
      13. Ensure that all messages have been received and that there are no  duplicates.


      It should be possible to run the test by doing the following.

      1. Download a distribution of hornetq (I'm using 2.1.2).
      2. Unzip the attached zip file to $HORNETQ_HOME/examples/jms
      3. Open a command prompt and cd into  $HORNETQ_HOME/examples/jms/clustered-standalone-failover
      4. Run ./build.sh
      5. Watch for the outcome of test. You may need to run the test a number  of times as you don't (well I don't) always get the same outcome.


      I ran this scenario 20 times with a number of different outcomes

      1. Sometimes not all of the messages are received by the consumer at  step 4.
      2. Sometimes the consumer has received duplicate messages at step 4.
      3. The live server cannot start again at step 10. I think this is due to cluster bridge trying to make a connection before the live server and established a connection to the backup.

       

      There is no exception for outcomes 1 & 2 as such. It is simply a case of the Reconciler instance either being asked to remove an object from its cache more than once or items being left in the Reconcilers cache.

       

      The resultant exception for scenario 3 is:

       

      [java] HornetQServer_1 err:DEPLOYMENTS IN ERROR:
           [java] HornetQServer_1 err:  Deployment "JMSServerManager" is in error due to: HornetQException[errorCode=104 message=Connected server is not a backup server]
           [java] HornetQServer_1 err:
           [java] HornetQServer_1 err:    at org.jboss.kernel.plugins.deployment.AbstractKernelDeployer.internalValidate(AbstractKernelDeployer.java:278)
           [java] HornetQServer_1 err:    at org.jboss.kernel.plugins.deployment.AbstractKernelDeployer.validate(AbstractKernelDeployer.java:174)
           [java] HornetQServer_1 err:    at org.hornetq.integration.bootstrap.HornetQBootstrapServer.bootstrap(HornetQBootstrapServer.java:158)
           [java] HornetQServer_1 err:    at org.jboss.kernel.plugins.bootstrap.AbstractBootstrap.run(AbstractBootstrap.java:83)
           [java] HornetQServer_1 err:    at org.hornetq.integration.bootstrap.HornetQBootstrapServer.run(HornetQBootstrapServer.java:116)
           [java] HornetQServer_1 err:    at org.hornetq.common.example.SpawnedHornetQServer.main(SpawnedHornetQServer.java:35)
           [java] HornetQServer_1 out:DEPLOYMENTS IN ERROR:
           [java] HornetQServer_1 out:  Deployment "JMSServerManager" is in error due to: HornetQException[errorCode=104 message=Connected server is not a backup server]
           [java] HornetQServer_1 out:
           [java] java.lang.RuntimeException: server failed to start
           [java]     at org.hornetq.common.example.SpawnedVMSupport.spawnVM(SpawnedVMSupport.java:154)
           [java]     at org.hornetq.common.example.HornetQExample.startServer(HornetQExample.java:144)
           [java]     at org.hornetq.jms.example.ClusteredStandaloneWithNoConsumersExample.runExample(ClusteredStandaloneWithNoConsumersExample.java:148)
           [java]     at org.hornetq.common.example.HornetQExample.run(HornetQExample.java:71)
           [java]     at org.hornetq.jms.example.ClusteredStandaloneWithNoConsumersExample.main(ClusteredStandaloneWithNoConsumersExample.java:62)
           [java]

       

      NOTE: I have also tried this scenario using the TRUNK version with  the same outcome.

       


      I have yet to be able to run this scenario through to completion.

       

      I'm guessing my issues are due to configuration, some misunderstanding of HornetQ or an issue with the test scenario itself.

       

      I would appreciate any help in trying to get this scenario to work.

       

      Thanks,

      Grant