Failover and Fail BACK

smays@edmunds.com Apr 9, 2010 6:10 PM

Hello there! We are using 5.3.0.5 deployed as a HA failover cluster via an NFS mount.

We are able to kill a broker (which we are calling the HOT broker) and get a second broker (or COLD broker) to take over with only a few messages lost in transition (353 out of 1.3 Million) which are chalking up to "was in transit in memory on the way to the NFS server". However, if we restart the HOT broker it waits for the exclusive lock and if we kill the COLD broker (which is then the live broker) we get what looks like either KahaDB or persistent-store corruption and are unable to continue until we stop both brokers, rm -rf the store/data directory then restart the HOT broker.

Has anyone else seen this issue? Anyone know what we could do to make it fail BACK to HOT without failure?

Next up for us will be HA failover with network of brokers and we'll post how we did it with an NFS mount when we get it to work!

Thank you all!

Steve Mays

Edmunds.com