8 Replies Latest reply on Feb 18, 2010 5:06 AM by ataylor

initial connection may fail if noe already failed over

ataylor Feb 1, 2010 1:45 PM

This can happen for statically configured session factories. If you have a session factory configured to connect to node A and fail over to Node B and Node A has gone gone down then any session created will always either try to reconnect for ever to node A, if recconnectAttempts=-1, or simple fail. This is an issue for managed connections in the app server which are configured like this.

We could add a new flag, say failoverOnInitialConnection, which if set to true we try to connect to the backup server. The code would be something like this in ClientSessionFactory:createSession(...)

theConnection = getConnectionWithRetry(reconnectAttempts);

                  if (theConnection == null)
                  {
                     if(failoverOnInitialConnect && backupConnectorFactory != null)
                     {
                        // We have failed connecting to the main server so:

connectorFactory = backupConnectorFactory;

transportParams = backupTransportParams;

backupConnectorFactory = null;

backupTransportParams = null;

                        theConnection = getConnectionWithRetry(reconnectAttempts);
                     }
                     if (theConnection == null)
                     {
                        if (exitLoop)
                        {
                           return null;
                        }

                        throw new HornetQException(HornetQException.NOT_CONNECTED,
                                                   "Unable to connect to server using configuration " + connectorConfig);
                     }

}

The user can decide howlong to wait before doing this by configuring reconnectAttempts and maxRetryInterval etc.

1. Re: initial connection may fail if noe already failed over

ataylor Feb 3, 2010 4:51 AM (in response to ataylor)

comments anyone!
Actions
2. Re: initial connection may fail if noe already failed over

radhikasivaraj Feb 8, 2010 4:51 AM (in response to ataylor)

HI, Should I apply this fix in createSession method of org.hornetq.core.client.impl.FailoverManagerImpl.java ? Thinking whether I can build the code myself with this fix from the 2.0.0 GA java source. Pls advice.
Actions
3. Re: initial connection may fail if noe already failed over

timfox Feb 8, 2010 4:55 AM (in response to ataylor)

It isn't fixed yet. We're still discussing it
Actions
4. Re: initial connection may fail if noe already failed over

ataylor Feb 9, 2010 3:58 AM (in response to timfox)

Just a few thoughts. Obviously if we allow clients to failover if an initial connection fails we may create a split brain situation if the original server is still actually active. One thing we could do is for the secondary server to return a specific code if it hasnt failed over yet. In this instance the client would then retry the first server. problem here is that if there were no current live clients to force the failover it would still never connect. Is it possible somehow for the secondary server to check the status of the first server. I know at some point there was a discussion about the master server writing its status to a shared file, this would work on the shared store configuration but would we be able to detect when replication was being used?
Actions
5. Re: initial connection may fail if noe already failed over

clebert.suconic Feb 9, 2010 12:42 PM (in response to ataylor)

"if an initial connection fails we may create a split brain situation if the original server is still actually active. One thing we could do is for the secondary server to return a specific code if it hasnt failed over yet. In this instance the client would then retry the first server"

Isn't this the split brain quorum task? https://jira.jboss.org/jira/browse/HORNETQ-66 ?

Also, if we make any changes to failover and backups.. keep in mind we will need to install backup nodes in the middle of the operation, in order to support adding backup to a live node:

https://jira.jboss.org/jira/browse/HORNETQ-194
Actions
6. Re: initial connection may fail if noe already failed over

ataylor Feb 9, 2010 1:26 PM (in response to clebert.suconic)

its related to HORNETQ-66 but not the same. Its more to do with the order in which nodes are initially chosen to connect to. I dont think its related at all to the second.
Actions
7. Re: initial connection may fail if noe already failed over

timfox Feb 17, 2010 4:15 AM (in response to ataylor)

Sorry, for coming so late to this discussion.

Andy, I think what you say makes sense.

To summarise:

We introduce a new param, failoverOnInitialConnection.

When a client starts it will attempt reconnectAttempts times to connect to the live node. If it has not connected after reconnectAttempts attempts then, if failoverOnInitialConnection = true it will attempt to connection to the backup, if specified (do we try the backup also reconnectAtttempts times?), otherwise it will fail.

By default failoverOnInitialConnection will be false.

We have to be a bit careful with failoverOnInitialConnection=true since in an environment where you have a symmetric cluster of nodes, each with backups, each live node on startup will try to make cluster connections to each other node.

When the cluster is brought up, if live nodes aren't brought up in time, node N could instead make connections to backup nodes putting the cluster in an inconsistent state. This is actually the reason why we currently don't always try the backup if the live is not available. But I think if this is guarded by a flag it should be ok.

Regarding the split brain stuff, I think that is a bit off topic and is handled by a different JIRA
Actions
8. Re: initial connection may fail if noe already failed over

ataylor Feb 18, 2010 5:06 AM (in response to timfox)

I think we should probably try the backup only once, if it doesnt work then its probably a client side network issue or the whole intranet is down.

Your point about bringing the nodes up in time is a good one, i will make sure that this is well documented and like u say we will default to false.
Actions

Go to original post