1 2 Previous Next 23 Replies Latest reply on Jun 28, 2010 3:27 AM by Andy Taylor

    JMS Bridge stops retrying even with unlimited retry configured

    Yong Hao Gao Master

      We know that a JMS bridge can be used to reliably pass messages between two separate messaging systems. On server or network failures the bridge can retry to re-establish communications with source and/or target messaging servers until it is fully restored. The retry process basically include:

       

      1. Clean up resources.
      2. retry to get JMS objects, including JMS connection factories and destinations of source and target JMS servers, via JNDI.
      3. retry create JMS connections to source and target.
      4. retry to create JMS sessions/producers/consumers.

       

      This process can be repeated over and over again until it succeeds. Once the retry is ok, the bridge resumes work as normal.

       

      However, each of the above steps may require remoting invocations and may hang due to network problems. If it hangs, the whole retry process will be stuck there and the next retry will never happen.

       

      We have experienced that the bridge hangs at JNDI lookup, where the lookup remoting call seems get issued over the network but the response never return. The JBoss JNDI implementation has some parameter to control timeout so this may be solved by setting up a proper timeout. But what if users use other JNDI impl?

       

      And we can't say that JNDI is the only place where that hang can happen because the other remoting invocations are of similar nature (synchronous call over network). It could be more important with HornetQ than JBM because HornetQ has pluggable transport architecture, you cannot rely on a specific implementation.

       

      Usually restarting the server will solve the problem but this should be very rare in a production system.

       

      So I think we need some mechanism in the bridge itself to correctly handle the issue, i.e. whenever the network is back, the bridge will be restored by retrying.

       

      My opinion is we create a timer task to measure each retry. When a new retry is about to begin, it resets the timer. when the timer times out, it will clean the current retry and a new retry will be started.

        1 2 Previous Next