2 Replies Latest reply on Sep 8, 2010 6:19 PM by qwang_qwang

    ActiveMQ failover does not work with Windows Cluster

    qwang_qwang

      Hi,

       

      We have configured an ActiveMQ (5.3.2) failover cluster through Windows Cluster Manager, without using ActiveMQ's native failover mechanism.

       

      Below is a brief description of the system setup:

       

      We installed ActiveMQ on server A and B, with A being the primary server and B the secondary. Both instances refer to the same folder on a SAN drive as the persistence storage:

       

       

                  <kahaDB directory="
      tpndc9ac01\tpn-data$\QueueStore\data\kahadb"/>

              </persistenceAdapter>

              *************

      </broker>

       

      We defined a cluster group through Windows Cluster Administrator, and created an AMQ service inside this cluster group. This AMQ service is mapped to the ActiveMQ instances on both server A and B. The Windows Cluster Admin will ensure that only one instance is running.

       

      At runtime, we first turned on AMQ on server A, and everything worked smoothly. Then we turned off AMQ on server A, which caused the cluster to fail over to AMQ on server B. There were some messages accumulated in the persistent queue during the failover. However, I could not see those messages in AMQ on server B. On the other hand, messages sent to the AMQ cluster after B was brought up were visible to AMQ on server B.

       

      if I turn off AMQ on server B and fail over to server A, I can see those previously unseen messages in AMQ on server A.

       

      Since both AMQ instances refer to the same the KahaDB as persistence storage, I am wondering why the same batch of messages are only visible to AMQ server A but bot server B. Any thoughts or suggestions?

       

      Thanks,

      Qilong

        • 1. Re: ActiveMQ failover does not work with Windows Cluster
          garytully

          This looks like the SAN is not really shared.

           

          The KahaDB store uses a shared file lock, nio channel lock, to ensure exclusive access to the file system store directory. It would be useful to validate if this mechanism works with your SAN. Start both BrokerA and BrokerB, only one of them should get the lock and startup successfully. If both get a lock there is a sharing/sync problem as both are seeing different versions of the directory.

           

          This appears to be what is happening in your test, does the SAN volume need to be unmounted on serverA and remounted on serverB as part of failover so that the SAN state is consistent?

          • 2. Re: ActiveMQ failover does not work with Windows Cluster
            qwang_qwang

            We were able to get this issue resolved.

             

            The root cause for this issue is we forgot to add a dependency between the SAN drive's logical name and the physical disk resource in Windows Cluster Administrator. As a result, after failover the secondary broker did not have access to the physical disk resource which had the persisted messages. We have added that dependency and it seems persistence works fine during failover. 

             

            Thanks for pointing us to the right direction.