1 Reply Latest reply on Jan 19, 2011 7:44 AM by flume

    Message broker won't come up again when database goes down

      As per AMQ-2497 I would like to ask a question concerning a possible fix for this. We are quite baffled that this issue doesn't get the full attention it deserves since it is marked as blocking and it is a show stopper for many people going to production.

       

      We have the same issue with a PostgreSQL database. It seems that the protected Connection in the DefaultDatabaseLocker becomes stale when the DB goes down but is still used whenever the keepAlive-method is called. Only at the start of the DefaultDatabaseLocker is the connection retrieved from the datasource (which in our case has a validationQuery) so the keepAlive catches any Exception while using the stale connection. When this happens the keepAlive returns a false which results in a stopBroker()-method call on the JDBCPersistenceAdapter.

       

      Maybe it would be better to diversify the catch-statement in the keepAlive-method and whenever this specific Exception (in our case org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command) occurs (but any SQLException will do I think), take the necessary precautions:

       

           

      • stop all the transport-connectors so that no further messages are send into ActiveMQ

           

      • schedule a different keepAlive that simply checks if the DB is up again

            -> if up again: re-initialize the DefaultDatabaseLocker and start the transport-connectors (but which?) again

       

      I haven't come around implementing a patch to the 5.5-SNAPSHOT yet but I am willing to do so if my assumptions stated above are somewhat correct.

        • 1. Re: Message broker won't come up again when database goes down

          Finally had some time to do some work on this.

          It appears that there is a test present in the activemq-core module called org.apache.activemq.broker.ft.DbRestartJDBCQueueMasterSlaveTest that tests the scenario where the DB goes down and comes up again in a master/slave configuration.

          The outcome of this test is that the master goes down never to come up again (which is the state that shouldn't occur). On the other hand the slave becomes the master when the DB comes up again (which is after the master went down).

          This test doesn't check if the master or slave is up or down nor what master/slave states they are in.

           

          So basically the problem is that when a master sees his DB-store fail he should revert to a slave state and not to the dead state. And when the DB comes up again the fastest of the 2 slaved brokers becomes master. This way we don't have to start the dead master after each DB failure.