1 Reply Latest reply on Nov 27, 2009 10:46 AM by jmesnil

    connection to a backup in a cluster

    jmesnil

      I found an issue while working on the cluster with backup failover test.

      setup is following:

      * 3 live servers (0, 1, 2) and 3 corresponding backup server (3, 4, 5)
      * cluster connections b/w all the nodes

      the test:
      * start all servers
      * wait for all bindings
      * send/receive messages
      * fail node #0
      => backup #3 is activated and start its cluster conns to live nodes #1 & #2
      * wait for all bindings
      * send/receive messages
      * fail node #1
      => backup #4 is activated and starts its cluster conns to live nodes #0 and #2

      And there is a problem here: the backup #4 has a cluster conn configured with live #0 and #backup #3.
      It will try to connect to #0 again and again and it will not connect to backup #3.

      This cluster conn issue can be replicated with a client configured with static connectors, to reconnect infinitely and live server which is down. it will go in a infinite loop when creating the session even though there is a backup server configured:

       // #0 is the live node, #1 is its backup
       setupClusters();
      
       startServers(0, 1);
      
       stopServers(0);
      
       TransportConfiguration liveTC = new TransportConfiguration(InVMConnectorFactory.class.getName());
       liveTC.getParams().put(TransportConstants.SERVER_ID_PROP_NAME, 0);
      
       TransportConfiguration backupTC = new TransportConfiguration(InVMConnectorFactory.class.getName());
       backupTC.getParams().put(TransportConstants.SERVER_ID_PROP_NAME, 1);
      
       ClientSessionFactoryImpl sf = new ClientSessionFactoryImpl(liveTC, backupTC);
       sf.setReconnectAttempts(-1);
      
       // => infinite loop to connecto the server #0 which is down
       ClientSession session = sf.createSession();
       assertNotNull(session);
      


      I'll have to change the ClusterConnection code to support that use case. Something like connecting to the live server a finite number of time and if it does not succeeds, open a connection to the backup server instead. I need to think more about it as it can introduce another set of pb (eg while starting a cluster, if a cluster conn connect to a backup before the corresponding live server is started and activate it, etc.)