HI, Should I apply this fix in createSession method of org.hornetq.core.client.impl.FailoverManagerImpl.java ? Thinking whether I can build the code myself with this fix from the 2.0.0 GA java source. Pls advice.
It isn't fixed yet. We're still discussing it
Just a few thoughts. Obviously if we allow clients to failover if an initial connection fails we may create a split brain situation if the original server is still actually active. One thing we could do is for the secondary server to return a specific code if it hasnt failed over yet. In this instance the client would then retry the first server. problem here is that if there were no current live clients to force the failover it would still never connect. Is it possible somehow for the secondary server to check the status of the first server. I know at some point there was a discussion about the master server writing its status to a shared file, this would work on the shared store configuration but would we be able to detect when replication was being used?
"if an initial connection fails we may create a split brain situation if the original server is still actually active. One thing we could do is for the secondary server to return a specific code if it hasnt failed over yet. In this instance the client would then retry the first server"
Isn't this the split brain quorum task? https://jira.jboss.org/jira/browse/HORNETQ-66 ?
Also, if we make any changes to failover and backups.. keep in mind we will need to install backup nodes in the middle of the operation, in order to support adding backup to a live node:
its related to HORNETQ-66 but not the same. Its more to do with the order in which nodes are initially chosen to connect to. I dont think its related at all to the second.
Sorry, for coming so late to this discussion.
Andy, I think what you say makes sense.
We introduce a new param, failoverOnInitialConnection.
When a client starts it will attempt reconnectAttempts times to connect to the live node. If it has not connected after reconnectAttempts attempts then, if failoverOnInitialConnection = true it will attempt to connection to the backup, if specified (do we try the backup also reconnectAtttempts times?), otherwise it will fail.
By default failoverOnInitialConnection will be false.
We have to be a bit careful with failoverOnInitialConnection=true since in an environment where you have a symmetric cluster of nodes, each with backups, each live node on startup will try to make cluster connections to each other node.
When the cluster is brought up, if live nodes aren't brought up in time, node N could instead make connections to backup nodes putting the cluster in an inconsistent state. This is actually the reason why we currently don't always try the backup if the live is not available. But I think if this is guarded by a flag it should be ok.
Regarding the split brain stuff, I think that is a bit off topic and is handled by a different JIRA
I think we should probably try the backup only once, if it doesnt work then its probably a client side network issue or the whole intranet is down.
Your point about bringing the nodes up in time is a good one, i will make sure that this is well documented and like u say we will default to false.