11 Replies Latest reply on Jun 2, 2011 10:48 PM by roarke100

    HornetQ Server doesn't recover from Client Failure

    roarke100

      Hi,

       

      First of all this is a very specific situation that I have seen happen in the wild and can reproduce using a debugger.  It seems to be a fundamental flaw that there are any situations in which a server is unable to close timed out connections.  It appears that if for some reason the client application locks up in a way that holds the socket connection open the server is then unable to cleanup the connection after the client ttl expires.

       

      Not being entirely familiar with the code, it basically looks like the socket write call is being made while holding a lock that needs to be held in order to cleanup the connection on the server side.  To reproduce the problem I simply put a breakpoint in the client code on the SocketInputStream read method and the PingRunner to prevent the connection from being closed from the client (after a message is sent).  The connection on the server will then remain open beyond the client ttl.

       

      I can reproduce this in trunk code using the NettyConnection.

       

      Shouldn't the server fail client connections beyond it's ttl regardless of client problems?  This seems like malicious client code could take advantage of this problem.  Especially since future writes to that connection also sit in while loop consuming cpu cycles waiting for the connection to free up.

       

      --Roarke

        • 1. Re: HornetQ Server doesn't recover from Client Failure
          clebert.suconic

          >> "It appears that if for some reason the client application locks up in a way that holds the socket connection open the server is then unable to cleanup the connection after the client ttl expires."

           

          What that means?

           

          I'm not sure you can make anything to hold a lock on the connection, unless you hack it with a debugger.

           

          all that should happen is the connection expiring after not receiving pings or any other msg for more than TTL seconds.

           

           

          Anything other than that (if you are reporting a bug), I would need a testcase showing the issue.

          • 2. Re: HornetQ Server doesn't recover from Client Failure
            roarke100

            You say you are not sure that this can happen outside of a debugger, but we have gotten into this situation multiple times now (5 times in 4 weeks JBoss6 Hornetq 2.1.2), but because we have 90+ connections it is a large group of clients to sort through to see how it happened.  We have the JBoss thread dump and JVM cpu monitor as proof of the situation, we just don't know what is causing it to happen on the client.  We are working on it and if we find an issue in hornetq I will let you know.  One speculation is that a client is nearly running out of memory seizing up the JVM with garbage collection.  I started this discussion to raise a broader point though.

             

            Wouldn't it be safer to err on the side that allows the server to close connections in all situations (even if the client is holding a connection open)? 

            • 3. Re: HornetQ Server doesn't recover from Client Failure
              clebert.suconic

              ok, a thread dump is good information.

               

              Also: Can you upgrade to 2.2.2?

              • 4. Re: HornetQ Server doesn't recover from Client Failure
                clebert.suconic

                Please attach both your client and server's dumps. Or any other information you have.

                • 5. Re: HornetQ Server doesn't recover from Client Failure
                  roarke100

                  We can, but not immediately.  We need to upgrade a number of servers simultaneously because of the client incompatibilities.  I can upgrade our test environment, but the situation hasn't arisen without unnatural (i.e. a debugger) intervention in our test environment.

                   

                  Attached is the server thread dump.  Note the NettyConnection references.  When the problem arises it seems to last around 30 min though we were only able to observer the actual time once when we happen have a jvm monitor on jboss at the time of the problem.  During that time the the repeated thread dumps were consistent in regards to the NettyConnection class (specifically it appeared all the load was coming from the NettyConnection write method where it was looping in the while loop waiting for the socket to free up).  During that 30min messaging was slow on all topics.  I don't know which topic was having the connection trouble.  After the 30min everything goes back to normal no messages lost.  I don't have client thread dumps because the are distributed across our users desktops.  We are searching there logs now if we find anything interesting we will pass it along.

                   

                  Obviously there are a number of things that could be going wrong. It could be a server configuration issue, a networking issue, a hornetq issue, etc...  I don't think you can diagnose the issue we are running into.  Though any help you can provide is greatly appreciated.  It just seems to me the server should be able to timeout the connection, and you can see in the thread dump it is trying to but can't because it is blocked by the socket write.

                  • 6. Re: HornetQ Server doesn't recover from Client Failure
                    clebert.suconic

                    This dump is from  your real environment. or with a debugger attached?

                     

                     

                    BTW: Just if you have it.. I prefer the output from the JDK better (kill -3) or the command line to get a dump. (I'm not sure if it's easier to understand or I'm just used to that)

                    • 7. Re: HornetQ Server doesn't recover from Client Failure
                      roarke100

                      This is from the real environment.  This is all I have for a thread dump from the live environment.  I can produce your prefered output from our test environment using the debugger.  Give me a few minutes. 

                      • 8. Re: HornetQ Server doesn't recover from Client Failure
                        clebert.suconic

                        I can produce your prefered output from our test environment using the debugger.  Give me a few minutes.

                         

                        Don't bother about that... Just if you had it.

                        • 9. Re: HornetQ Server doesn't recover from Client Failure
                          clebert.suconic

                          What I see in your dumps:

                           

                          In one of your threads, you have a SendBatched waiting the channel to unlock in order to send the packet to the client:

                           

                          hread: Thread-21 (group:HornetQ-server-threads392518037-1632393274) : priority:5, demon:false, threadId:144, threadState:RUNNABLE

                          java.lang.Thread.yield(Native Method)

                          org.hornetq.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:167)

                          org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:199)

                          - locked <0x4779a7cc> (a java.lang.Object)

                          org.hornetq.core.protocol.core.impl.ChannelImpl.sendBatched(ChannelImpl.java:147)

                          org.hornetq.core.protocol.core.impl.CoreSessionCallback.sendMessage(CoreSessionCallback.java:75)

                          org.hornetq.core.server.impl.ServerConsumerImpl.deliverStandardMessage(ServerConsumerImpl.java:644)

                          org.hornetq.core.server.impl.ServerConsumerImpl.handle(ServerConsumerImpl.java:253)

                          - locked <0x27216d37> (a java.lang.Object)

                          org.hornetq.core.server.impl.QueueImpl.handle(QueueImpl.java:1384)

                          - locked <0x3f766bd> (a org.hornetq.core.server.impl.QueueImpl)

                          org.hornetq.core.server.impl.QueueImpl.deliver(QueueImpl.java:1175)

                          - locked <0x3f766bd> (a org.hornetq.core.server.impl.QueueImpl)

                          org.hornetq.core.server.impl.QueueImpl.access$700(QueueImpl.java:67)

                          org.hornetq.core.server.impl.QueueImpl$DeliverRunner.run(QueueImpl.java:1601)

                          org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)

                          java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

                          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

                          java.lang.Thread.run(Thread.java:662)

                           

                           

                          However, the channel is holding here:

                           

                           

                          java.net.SocketOutputStream.socketWrite0(Native Method)

                          java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)

                          java.net.SocketOutputStream.write(SocketOutputStream.java:136)

                          org.jboss.netty.buffer.HeapChannelBuffer.getBytes(HeapChannelBuffer.java:116)

                          org.jboss.netty.buffer.DynamicChannelBuffer.getBytes(DynamicChannelBuffer.java:156)

                          org.jboss.netty.channel.socket.oio.OioWorker.write(OioWorker.java:119)

                          - locked <0x7b713328> (a java.net.SocketOutputStream)

                          org.jboss.netty.channel.socket.oio.OioServerSocketPipelineSink.handleAcceptedSocket(OioServerSocketPipelineSink.java:126)

                          org.jboss.netty.channel.socket.oio.OioServerSocketPipelineSink.eventSunk(OioServerSocketPipelineSink.java:64)

                          org.jboss.netty.channel.StaticChannelPipeline$StaticChannelHandlerContext.sendDownstream(StaticChannelPipeline.java:502)

                          org.jboss.netty.channel.SimpleChannelHandler.writeRequested(SimpleChannelHandler.java:304)

                          org.jboss.netty.channel.SimpleChannelHandler.handleDownstream(SimpleChannelHandler.java:266)

                          org.jboss.netty.channel.StaticChannelPipeline.sendDownstream(StaticChannelPipeline.java:385)

                          org.jboss.netty.channel.StaticChannelPipeline.sendDownstream(StaticChannelPipeline.java:380)

                          org.jboss.netty.channel.Channels.write(Channels.java:611)

                          org.jboss.netty.channel.Channels.write(Channels.java:578)

                          org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:258)

                          org.hornetq.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:206)

                          org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:199)

                          - locked <0x44d5c668> (a java.lang.Object)

                          org.hornetq.core.protocol.core.impl.ChannelImpl.sendBatched(ChannelImpl.java:147)

                          org.hornetq.core.protocol.core.impl.CoreSessionCallback.sendMessage(CoreSessionCallback.java:75)

                          org.hornetq.core.server.impl.ServerConsumerImpl.deliverStandardMessage(ServerConsumerImpl.java:644)

                          org.hornetq.core.server.impl.ServerConsumerImpl.handle(ServerConsumerImpl.java:253)

                          - locked <0x2ed4f926> (a java.lang.Object)

                          org.hornetq.core.server.impl.QueueImpl.handle(QueueImpl.java:1384)

                          - locked <0x203145cf> (a org.hornetq.core.server.impl.QueueImpl)

                          org.hornetq.core.server.impl.QueueImpl.deliver(QueueImpl.java:1175)

                          - locked <0x203145cf> (a org.hornetq.core.server.impl.QueueImpl)

                          org.hornetq.core.server.impl.QueueImpl.access$700(QueueImpl.java:67)

                          org.hornetq.core.server.impl.QueueImpl$DeliverRunner.run(QueueImpl.java:1601)

                          org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)

                          java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

                          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

                          java.lang.Thread.run(Thread.java:662)

                           

                           

                           

                          For some reason the socketWrite0 is not finishing here.

                           

                           

                          You have Netty configured to run with OIO. If you had NIO=true on the NettyAcceptor, (Chapter 16.3 on the 2.2.2 doc) the socketWrite wouldn't affect the lock since the system would give up writing in 10 second.

                           

                           

                           

                          Can you try NIO at your system?

                           

                           

                          You should probably tune up your Client. Or you can give a try to NIO on netty also.

                          • 10. Re: HornetQ Server doesn't recover from Client Failure
                            clebert.suconic

                            I'm pinging Trustin and asking for help about the socketwrite0 not unblocking in case of a failure.

                            • 11. Re: HornetQ Server doesn't recover from Client Failure
                              roarke100

                              Ahhh... RTFM     Turning on NIO seems to fix our issue. 

                               

                              In regards to the socket being held open it seems to be caused when one of our clients runs out of memory.  Though I haven't been able to reproduce it there is evidence in our client logs which indicates as much.  Once in this state and using OIO the server cannot cleanup the client connection once ttl is reached. The thread holding the server socket connection is holding a lock that is required to do the cleanup. As seen in this part of the attached thread dump.

                               

                              Thread: hornetq-failure-check-thread : priority:5, demon:false, threadId:54, threadState:BLOCKED

                              - waiting on <0x27216d37> (a java.lang.Object)

                              org.hornetq.core.server.impl.ServerConsumerImpl.setStarted(ServerConsumerImpl.java:398)

                              org.hornetq.core.server.impl.ServerSessionImpl.doRollback(ServerSessionImpl.java:1145)

                              org.hornetq.core.server.impl.ServerSessionImpl.rollback(ServerSessionImpl.java:575)

                              org.hornetq.core.server.impl.ServerSessionImpl.doClose(ServerSessionImpl.java:250)

                              - locked <0x7bfe9a1> (a org.hornetq.core.server.impl.ServerSessionImpl)

                              org.hornetq.core.server.impl.ServerSessionImpl.access$100(ServerSessionImpl.java:76)

                              org.hornetq.core.server.impl.ServerSessionImpl$1.done(ServerSessionImpl.java:931)

                              org.hornetq.core.persistence.impl.journal.OperationContextImpl.executeOnCompletion(OperationContextImpl.java:158)

                              org.hornetq.core.persistence.impl.journal.JournalStorageManager.afterCompleteOperations(JournalStorageManager.java:394)

                              org.hornetq.core.server.impl.ServerSessionImpl.close(ServerSessionImpl.java:921)

                              org.hornetq.core.server.impl.ServerSessionImpl.connectionFailed(ServerSessionImpl.java:1079)

                              org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.callFailureListeners(RemotingConnectionImpl.java:482)

                              org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:254)

                              org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread.run(RemotingServiceImpl.java:541)

                               

                               

                              Thank you for your help with this!