8 Replies Latest reply on Feb 18, 2010 5:06 AM by ataylor

    initial connection may fail if noe already failed over

    ataylor

      This can happen for statically configured session factories. If you have a session factory configured to connect to node A and fail over to Node B and Node A has gone gone down then any session created will always either  try to reconnect for ever to node A, if recconnectAttempts=-1, or simple fail. This is an issue for managed connections in the app server which are configured like this.

       

      We could add a new flag, say failoverOnInitialConnection, which if set to true we try to connect to the backup server. The code would be something like this in ClientSessionFactory:createSession(...)

       

                        theConnection = getConnectionWithRetry(reconnectAttempts);

       

                        if (theConnection == null)
                        {
                           if(failoverOnInitialConnect && backupConnectorFactory != null)
                           {
                              // We have failed connecting to the main server so:

       

                              connectorFactory = backupConnectorFactory;

       

                              transportParams = backupTransportParams;

       

                              backupConnectorFactory = null;

       

                              backupTransportParams = null;

       

                              theConnection = getConnectionWithRetry(reconnectAttempts);
                           }
                           if (theConnection == null)
                           {
                              if (exitLoop)
                              {
                                 return null;
                              }

       

                              throw new HornetQException(HornetQException.NOT_CONNECTED,
                                                         "Unable to connect to server using configuration " + connectorConfig);
                           }

       

                        }

       

       

      The user can decide howlong to wait before doing this by configuring reconnectAttempts and maxRetryInterval etc.

        • 1. Re: initial connection may fail if noe already failed over
          ataylor
          comments anyone!
          • 2. Re: initial connection may fail if noe already failed over
            radhikasivaraj

            HI, Should I apply this fix in createSession method of org.hornetq.core.client.impl.FailoverManagerImpl.java ? Thinking whether I can build the code myself with this fix from the 2.0.0 GA java source. Pls advice.

            • 3. Re: initial connection may fail if noe already failed over
              timfox
              It isn't fixed yet. We're still discussing it
              • 4. Re: initial connection may fail if noe already failed over
                ataylor
                Just a few thoughts. Obviously if we allow clients to failover if an initial connection fails we may create a split brain situation if the original server is still actually active. One thing we could do is for the secondary server to return a specific code if it hasnt failed over yet. In this instance the client would then retry the first server. problem here is that if there were no current live clients to force the failover it would still never connect. Is it possible somehow for the secondary server to check the status of the first server. I know at some point there was a discussion about the master server writing its status to a shared file, this would work on the shared store configuration but would we be able to detect when replication was being used?
                • 5. Re: initial connection may fail if noe already failed over
                  clebert.suconic

                  "if an initial connection fails we may create a split brain situation if the original server is still actually active. One thing we could do is for the secondary server to return a specific code if it hasnt failed over yet. In this instance the client would then retry the first server"

                   

                  Isn't this the split brain quorum task? https://jira.jboss.org/jira/browse/HORNETQ-66  ?

                   

                   

                  Also, if we make any changes to failover and backups.. keep in mind we will need to install backup nodes in the middle of the operation, in order to support adding backup to a live node:

                   

                  https://jira.jboss.org/jira/browse/HORNETQ-194

                  • 6. Re: initial connection may fail if noe already failed over
                    ataylor
                    its related to HORNETQ-66 but not the same. Its more to do with the order in which nodes are initially chosen to connect to. I dont think its related at all to the second.
                    • 7. Re: initial connection may fail if noe already failed over
                      timfox

                      Sorry, for coming so late to this discussion.

                       

                      Andy, I think what you say makes sense.

                       

                      To summarise:

                       

                      We introduce a new param, failoverOnInitialConnection.

                       

                      When a client starts it will attempt reconnectAttempts times to connect to the live node. If it has not connected after reconnectAttempts attempts then, if failoverOnInitialConnection = true it will attempt to connection to the backup, if specified (do we try the backup also reconnectAtttempts times?), otherwise it will fail.

                       

                      By default failoverOnInitialConnection will be false.

                       

                      We have to be a bit careful with failoverOnInitialConnection=true since in an environment where you have a symmetric cluster of nodes, each with backups, each live node on startup will try to make cluster connections to each other node.

                       

                      When the cluster is brought up, if live nodes aren't brought up in time, node N could instead make connections to backup nodes putting the cluster in an inconsistent state. This is actually the reason why we currently don't always try the backup if the live is not available. But I think if this is guarded by a flag it should be ok.

                       

                      Regarding the split brain stuff, I think that is a bit off topic and is handled by a different JIRA

                      • 8. Re: initial connection may fail if noe already failed over
                        ataylor

                        I think we should probably try the backup only once, if it doesnt work then its probably a client side network issue or the whole intranet is down.

                         

                        Your point about bringing the nodes up in time is a good one, i will make sure that this is well documented and like u say we will default to false.