1 Reply Latest reply on Mar 11, 2009 6:57 AM by ubhole

    Shared Filesystem Master/Slave: missing messages on failover

    ariekenb

      I posted a JIRA issue here (http://issues.apache.org/activemq/browse/AMQ-2149) but I'd also like to see if anyone in the FUSE community has any ideas on this issue.

       

      I've been testing a shared filesystem master/slave setup and manually killing the master broker to watch the slave take over.  I've occasionally seen some flaky behavior in this.  Usually things work fine but occasionally when a slave broker becomes the master, clients reconnect to the new master and immediate miss messages.  Sometimes I also see clients receive old, already-acknowledged messages after a failover.

       

      To test this more thoroughly, I wrote a little shell script to force failover of the master broker every 40 seconds.  The script is attached to the JIRA issue.

       

      In the JIRA issue I also have two test programs that each start 10 sender/receiver pairs on 10 queues using standard JMS APIs and persistent delivery.  One test program uses transactions and commits after each send and receive, the other uses AUTO_ACKNOWLEDGE.  The programs use the failover transport in the senders and receivers to automatically connect to the new master broker after failover.  Both programs send 75kb text messages every 25 ms to each queue.

       

      I run the script to cause failovers and one of the test programs to create senders/receivers.  I am able to make out-of-order messages happen on failover both using transactions or AUTO_ACKNOWLEDGE.  I've also tried setting the "syncOnWrite" parameter of the amqPersistenceAdapter to true and false, thinking synchronous writes might make things more reliable.  Unfortunately I can make the problem happen with either syncOnWrite enabled or disabled.  All combinations of syncOnWrite and either transactions or AUTO_ACKNOWLEDGE result in less than 100% reliability on failover.

       

      I cannot find any documentation warning of the possibility of missing or out-of-order messages on shared filesystem failover, so I believe this must be a bug.  I am able to reproduce the problems on both Apache ActiveMQ 5.2.0 and FUSE Message Broker 5.3.0.0.

       

      Are there folks out there using shared filesystem master/slave for HA in production systems?  Thanks for any help or suggestions.

       

      Edited by: ariekenb on Mar 9, 2009 2:51 AM