6 Replies Latest reply on Feb 1, 2013 10:59 AM by alex75

    ClientSessionfactoryImpl: inconsistent delay for scheduleWithFixedDelay

    michael10

      We are using HornetQ 2.2.14 version.

      In our logfiles we found unexpected reconnection loggings, although there where no network connection problems.

       

      We found out that the problem is in the java.util.concurrent.ScheduledThreadPoolExecutorService.You use it in ClientSessionFactoryImpl


      It seems that the ScheduledExecutorService didn't executes the pingRunnables in time (we configured the default of 30sec.).

      So the server didn´t acknowledge the pings and after 60sec. (connetionTTL), the server close the client connection. Therefore the client must reconnect. And so on.....

      It's was only reproduceable on specific OS systems (Windows Server 2003 SP2 Build 3790) and its maybe hardware specific. We are also trying the newest java version.

       

      In our bugfix we are replace the ScheduledExecutorService in the ClientSessionFactory code and using the Timer.schedule() and TimerTask instead.

      In the method where the connection failover is handeled you call cancelScheduledTasks(); only if reconnectAttempts are not equals 0.

      Is there any reason why you don't call it if reconnectAttempts are set to 0?

      We are setting reconnectAttempts=0 and we are getting into troubles, when cancelScheduledTasks() is not called. The Timer will not be garbage collected if the TimerTask is not cancelled.

      So I have add cancelScheduledTasks(); also in the else case and it works fine.

       

      The orig code of ClientSessionImpl:

      private void failoverOrReconnect(final Object connectionID, final HornetQException me)

         {

            Set<ClientSessionInternal> sessionsToClose = null;

       

            synchronized (failoverLock)

            {

               if (connection == null || connection.getID() != connectionID)

               {

                  // We already failed over/reconnected - probably the first failure came in, all the connections were failed

                  // over then a async connection exception or disconnect

                  // came in for one of the already exitLoop connections, so we return true - we don't want to call the

                  // listeners again

       

                  return;

               }

       

               if (ClientSessionFactoryImpl.isTrace)

               {

                  ClientSessionFactoryImpl.log.trace("Client Connection failed, calling failure listeners and trying to reconnect, reconnectAttempts=" + reconnectAttempts);

               }

       

               // We call before reconnection occurs to give the user a chance to do cleanup, like cancel messages

               callFailureListeners(me, false, false);

       

               // Now get locks on all channel 1s, whilst holding the failoverLock - this makes sure

               // There are either no threads executing in createSession, or one is blocking on a createSession

               // result.

       

               // Then interrupt the channel 1 that is blocking (could just interrupt them all)

       

               // Then release all channel 1 locks - this allows the createSession to exit the monitor

       

               // Then get all channel 1 locks again - this ensures the any createSession thread has executed the section and

               // returned all its connections to the connection manager (the code to return connections to connection manager

               // must be inside the lock

       

               // Then perform failover

       

               // Then release failoverLock

       

               // The other side of the bargain - during createSession:

               // The calling thread must get the failoverLock and get its' connections when this is locked.

               // While this is still locked it must then get the channel1 lock

               // It can then release the failoverLock

               // It should catch HornetQException.INTERRUPTED in the call to channel.sendBlocking

               // It should then return its connections, with channel 1 lock still held

               // It can then release the channel 1 lock, and retry (which will cause locking on failoverLock

               // until failover is complete

       

               if (reconnectAttempts != 0)

               {

                  lockChannel1();

       

                  final boolean needToInterrupt;

       

                  synchronized (exitLock)

                  {

                     needToInterrupt = inCreateSession;

                  }

       

                  unlockChannel1();

       

                  if (needToInterrupt)

                  {

                     // Forcing return all channels won't guarantee that any blocked thread will return immediately

                     // So we need to wait for it

                     forceReturnChannel1();

       

                     // Now we need to make sure that the thread has actually exited and returned it's connections

                     // before failover occurs

       

                     synchronized (exitLock)

                     {

                        while (inCreateSession)

                        {

                           try

                           {

                              exitLock.wait(5000);

                           }

                           catch (InterruptedException e)

                           {

                           }

                        }

                     }

                  }

       

                  // Now we absolutely know that no threads are executing in or blocked in createSession, and no

                  // more will execute it until failover is complete

       

                  // So.. do failover / reconnection

       

                  CoreRemotingConnection oldConnection = connection;

       

                  connection = null;

       

                  try

                  {

                     connector.close();

                  }

                  catch (Exception ignore)

                  {

                  }

       

                  cancelScheduledTasks();

       

                  connector = null;

       

                  reconnectSessions(oldConnection, reconnectAttempts);

       

                  if (oldConnection != null)

                  {

                     oldConnection.destroy();

                  }

               }

               else

               {

                  CoreRemotingConnection connectionToDestory = connection;

                  if (connectionToDestory != null)

                  {

                     connectionToDestory.destroy();

                  }

       

                  connection = null;

               }

       

               if (connection == null)

               {

                  synchronized (sessions)

                  {

                     sessionsToClose = new HashSet<ClientSessionInternal>(sessions);

                  }

                  callFailureListeners(me, true, false);

               }

            }

       

            // This needs to be outside the failover lock to prevent deadlock

            if (connection != null)

            {

               callFailureListeners(me, true, true);

            }

            if (sessionsToClose != null)

            {

               // If connection is null it means we didn't succeed in failing over or reconnecting

               // so we close all the sessions, so they will throw exceptions when attempted to be used

       

               for (ClientSessionInternal session : sessionsToClose)

               {

                  try

                  {

                     session.cleanUp(true);

                  }

                  catch (Exception e)

                  {

                     ClientSessionFactoryImpl.log.error("Failed to cleanup session");

                  }

               }

            }

         }

       

       

       

      Greetings Michael

        • 1. Re: ClientSessionfactoryImpl: inconsistent delay for scheduleWithFixedDelay
          ataylor

          what do you mean when yoy say the runnable weren't called in time, were they called late, if so by how long, or were they never called.

           

          The reason we dont call cancelScheduleTasks when reconnectAttempts is 0 is because we dont need to pause them while waiting to reconnect.

           

          regarding using a TimerTask, we use our own executors so we can handle thread usage.

           

          Can you investigate why the ping runnables arent called?

          • 2. Re: ClientSessionfactoryImpl: inconsistent delay for scheduleWithFixedDelay
            michael10

            Hi.

            The pings are sending, but to late. It seems to be a multible of the clientFailureCheckPeriod. With a simple test program we reproduced it:

             

            private static ScheduledExecutorService scheduledExec = Executors.newScheduledThreadPool(5);

             

               public static void main(String[] Ps_args)

                {

                    scheduledExec.scheduleWithFixedDelay(new Runnable()

                    {

                        @Override

                        public void run()

                        {

                            System.out

                                    .println("Alle 5 Sekunden sollte das aufgerufen werden. Aufgerufen wurde es um: "

                                            + new Timestamp(System.currentTimeMillis()));

                        }

                    },

                            0,

                            5000,

                            TimeUnit.MILLISECONDS);

                }

             

            and the output was :

             

            Alle 5 Sekunden sollte das aufgerufen werden. Aufgerufen wurde es um: 2013-01-25 19:06:20.947

            Alle 5 Sekunden sollte das aufgerufen werden. Aufgerufen wurde es um: 2013-01-25 19:06:57.791

            Alle 5 Sekunden sollte das aufgerufen werden. Aufgerufen wurde es um: 2013-01-25 19:07:34.838

            Alle 5 Sekunden sollte das aufgerufen werden. Aufgerufen wurde es um: 2013-01-25 19:08:12.275

             

            So we have a delay of 37 sec!

            • 3. Re: ClientSessionfactoryImpl: inconsistent delay for scheduleWithFixedDelay
              ataylor

              I can only assume its environmental, have you tried on different physical machines? Also are you using a virtual vm at all, ive heard there are timing/clock issues when using them.

               

              I dont have a windows machine so cant really investigate.

              • 4. Re: ClientSessionfactoryImpl: inconsistent delay for scheduleWithFixedDelay
                michael10

                Currently we have problems with:

                Physical: Windows XP Prof 5.1.2600 SP3 x86

                and

                VM: Windows Server 2003 5.2.3790 SP2 x86

                • 5. Re: ClientSessionfactoryImpl: inconsistent delay for scheduleWithFixedDelay
                  jbertram

                  I took your code and ran it on my machine yesterday for several hours.  All the timings were almost exactly 5 seconds every time.  I'm running on Linux with kernel 3.2.0-36-generic-pae.

                   

                  At this point I'd say the problem was either your JVM or your OS.

                  • 6. Re: ClientSessionfactoryImpl: inconsistent delay for scheduleWithFixedDelay
                    alex75

                    On my notebook, with Windows 7 installed, the simple test program runs exact every time (Michael Woldrich is my colleague at our company).

                    So we opened an issue for Oracle and we'll see if we can find out if it's an Oracle/JDK or Oracle VM bug or if it's some hardware/operating system problem.

                    We'll let you know when we find out more.

                    We wanted to inform you, so you're warned about strange problems (with scheduled Runnables) in possibly other reported discussions/issues.