1 2 3 Previous Next 36 Replies Latest reply on Jun 1, 2009 9:23 AM by jmesnil Go to original post
      • 15. Re: Messages are lost on Queue?
        clebert.suconic

        This is actually affecting clustering, so I'm commenting out callListeners on RemotingConnectionImpl.

        Test org.jboss.messaging.tests.integration.remoting.DestroyConsumerTest will be failing until this could be properly fixed. (Well.. I could remove the test also, but that would only hide the problem ;-) )

        • 16. Re: Messages are lost on Queue?
          timfox

          Can you explain why you would want failure listeners to be called on destroy, and why that is causing the problem you are experiencing?

          • 17. Re: Messages are lost on Queue?
            timfox

            The examples don't even use ExceptionListeners, so I don't see what difference it would make if they were called or not.

            • 18. Re: Messages are lost on Queue?
              clebert.suconic

              The problem is, a CTRL-C on the client, is closing the socket.

              When the socket is closed, RemotingServiceImpl::connectionDestroyed is called and the connection is removed.

              At that point, RemotingConnectionImpl::destroy is called (Tim added this call yesterday), and nothing is cleared on Server-side. As a result you will have a dead consumer listed on QueueImpl, and nothing will ever clear that. (The connection doesn't exist on RemotingServiceImpl any more, so the pinger will forget about that connection).

              To fix this, we need to to cleanup server-side objects after RemotingServiceImpl::destroy or make sure Ping would catch this. But I'm a bit confused on how this should work (seems a bit dodgy now (as British people would say... http://www.merriam-webster.com/dictionary/Dodgy ))

              • 19. Re: Messages are lost on Queue?
                timfox

                I commented out this test, since it's causing a set of cascading failures in other integration tests.

                • 20. Re: Messages are lost on Queue?
                  jmesnil

                  an update about this issue (using trunk r5910).

                  1. ant runServer
                  2. ant perfListener -Ddrain.queue=false
                  3. Ctl+C the perfListener task

                  => on the server side, a RemotingServiceImpl.connectionDestroyed() is triggered
                  * the connection is removed from RemotingServiceImpl connections collection
                  * connection.destroy() is called => this won't call the listeners on the connection (no clean up)

                  Since the connection has been removed from the connections collection, the FailedConnectionTask will never check if it has expired, it won't call fail() on the connection which is the place where the associated resources are cleaned-up.

                  in short, the server will never clean up things when the client is Ctl+C.


                  A fix would be to keep the connection in the connections when RemotingServiceImpl.connectionDestroy() is called and let the FailedConnectionTask a chance to clean up after that.

                  One thing I don't understand: why is a connection detected as failed by the ConnectionFailedTask not removed from the connections?

                  • 21. Re: Messages are lost on Queue?
                    timfox

                     

                    "jmesnil" wrote:

                    A fix would be to keep the connection in the connections when RemotingServiceImpl.connectionDestroy() is called and let the FailedConnectionTask a chance to clean up after that.


                    +1.

                    How come ClientCrashTest passes though?


                    One thing I don't understand: why is a connection detected as failed by the ConnectionFailedTask not removed from the connections?


                    What is ConnectionFailedTask?

                    • 22. Re: Messages are lost on Queue?
                      timfox

                      If you mean FailedConnectionsTask, then when fail is called on the connection, this should result in the connection removing itself from that list anyway. You can debug it to verify that is the case.

                      • 23. Re: Messages are lost on Queue?
                        clebert.suconic

                         

                        "timfox" wrote:


                        +1.

                        How come ClientCrashTest passes though?



                        ClientCrash is being killed without closing the socket. So, we would miss a ping-pong, and we would identify the crash.

                        However when you CTRL-C, the socket is being closed (maybe by Shutdown Hooks on the VM???), and we (Netty, Mina?) are treating that event in a different way.

                        • 24. Re: Messages are lost on Queue?
                          timfox

                          Ok, so client crash test needs to be updated to also test the CTRL-C situation

                          • 25. Re: Messages are lost on Queue?
                            jmesnil

                             

                            "timfox" wrote:
                            Ok, so client crash test needs to be updated to also test the CTRL-C situation


                            Client crash tests is only checking that the remoting connection is removed when the client crashes. It was not checking if the associated server resources (server session, server consumer, etc.) were also properly cleaned up.

                            I've modified the ClientCrashTest to check the number of active sessions and it reports that there is still a session after the client crashes.


                            • 26. Re: Messages are lost on Queue?
                              jmesnil

                              A debugging session using ClientCrashTest makes things clearer to understand.
                              This is the code path on the server when the client crash:


                              NettyAcceptor$MessagingServerChannelHandler.channelDisconnected() is triggered
                              -> RemotingServiceImpl.connectionDestroyed() is called
                              * the connection is removed from RemotingServiceImpl connections map
                              -> RemotingConnectionImpl.destroy() is called
                              -> RemotingConnectionImpl.internalClose() is called
                              -> NettyConnection.close() is called
                              -> NettyAcceptor$Listener.connectionDestroyed() is called
                              -> RemotingServiceImpl.connectionDestroyed() is called *again*
                              * the connection is no longer in the connections map, the code path stops here


                              Nowhere in the code path, RemotingConnectionImpl's failureListeners are notfiied to clean up server-side resources associated to the remoting connecion (e.g. ServerSession is a FailureListener)

                              The callListeners() call in RemotingConnectionImpl.destroy() has been commented by Tim (r5467) because it affects clustering.
                              However we still need to cleanup server resources when the remoting connection impl is destroyed.

                              As an aside, BridgeImpl is also a FailureListener. If the other node crashes, the Bridge won't be notified of the failure and won't have the opportunity to clean up its resources too.



                              • 27. Re: Messages are lost on Queue?
                                jmesnil

                                ok, let recap what the problem is.
                                We want to ensure that server resources are properly cleaned up on the server no matter how the client is closed (properly, crash, network failures).

                                1/ normal case:
                                - the client closes the session
                                - the client exits.

                                => the server resources are properly cleaned up when the CLOSE packet is handled by the server session.
                                * there is no resource cleanup when the connection is closed (as triggered by Netty channelDisconnected event)

                                2/ unclean case (doing a Ctrl+C, a kill-9 or a System.exit leads to the same event on the server side):
                                - the client exits without closing the session

                                => the server will be informed by a channelDisconnected() event (for Netty)
                                * we wait for the connection TTL before cleaning up the server resources associated to the connection

                                3/ socket exception
                                - a problem occurs on the network

                                => the server will be informed by a Netty exceptionCaught event.
                                * we know for sure the connection is not working, the resources are cleaned up immediately

                                4/ missing heartbeat (e.g the network is unresponsive)

                                - the client does not reply a pong to a ping from the server

                                => in that case, we wait for the connection TTL (starting when the pong was missed) and clean up resources after the connection TTL is hit

                                cases #1, #3 & #4 works as expected (TODO: list the corresponding test cases)

                                case #2 is not working as expected:
                                - From RemotingServiceImpl point of view, both case #1 and #2 result in a connectionDestroyed event. We need to know more about it to decide if:
                                1. we destroy the connection immediately (clean case)
                                2. we keep it until the connection TTL is hit (crash case)

                                Trustin once talked about the SMTP protocol iirc which sends a "last" request before closing a connection.
                                We could do the same: just before closing the connection, we send a LAST request, so that we can know when the channelDisconnected event is triggered if the connection was closed properly (we received the LAST request) or not.
                                We can then pass the information to the RemotingServiceImpl which can then distinguish between case #1 and #2.

                                Sending this LAST request would occur in Channel.close().
                                We would intercept this LAST request in MessagingServerPacketHandler and flag its RemotingConnection as closed.

                                what do you think?

                                • 28. Re: Messages are lost on Queue?
                                  timfox

                                  2) and 3) should be handled the same on the server.

                                  In other words, the only case where we immediately remove the ServerSession is when we get an explicit SESSION_CLOSE message.

                                  Regarding the LAST request: we already have such a request; it's the SESSION_CLOSE request.

                                  • 29. Re: Messages are lost on Queue?
                                    jmesnil

                                     

                                    "timfox" wrote:
                                    2) and 3) should be handled the same on the server.

                                    In other words, the only case where we immediately remove the ServerSession is when we get an explicit SESSION_CLOSE message.


                                    This is not the case today: #2 results in a connectionDestroyed (with no clean-up) and #3 results in a connectionException(with immediate clean-up)

                                    #3 needs to be fixed so that the server session is removed only when the connection TTL is hit.

                                    "timfox" wrote:

                                    Regarding the LAST request: we already have such a request; it's the SESSION_CLOSE request.


                                    When I receive the SESSION_CLOSE on the server, I flag the remoting connection as "ready to be closed" once the server session has handled the packet.
                                    In the remoting service, when I received a connectionDestroyed event, I check if the remoting connection is ready to be closed.
                                    If true (normal case), I remove the connection from the remoting service and destroy it.
                                    else, I do nothing, the connection and the resources will be cleaned up when the connection TTL is hit.