I implemented the retry loop to find the connection factory. The MBean does start up on the other node, but now the failover takes about 5 min which seems a bit too long.
We are wondering if it is because the startSingleton in our mbeans are blocking. I am going to implement the startSingleton so that it returns immediately and do the work in a seperate thread. I will see if that works.
Does that sound right to you? Do you have any other suggestions?
That could very well be the issue. When there's a topology change, basically one thread loops through all the services that are monitoring the cluster notifying them of the change. Eventually those calls reach your singletons. If each of those singletons then takes a long time starting, the whole process will be slow. If the startup of a singleton is going to take a long time and it can be done asynchronously, it's definitely better to do it that way.
Thank you for your suggestions. We implemented the singletons so that they return immediately and run the actual tasks it is supposed to do asynchronously. We also implemented the retry loop to find the connection factory.
We just tested it and the initial results look very promising. The failover happens immediately and all our services come back up in a minute. We are going to be doing more testing over the next week to make sure everything is OK.