-
1. Re: HornetQ Server doesn't recover from Client Failure
clebert.suconic Jun 2, 2011 2:06 PM (in response to roarke100)>> "It appears that if for some reason the client application locks up in a way that holds the socket connection open the server is then unable to cleanup the connection after the client ttl expires."
What that means?
I'm not sure you can make anything to hold a lock on the connection, unless you hack it with a debugger.
all that should happen is the connection expiring after not receiving pings or any other msg for more than TTL seconds.
Anything other than that (if you are reporting a bug), I would need a testcase showing the issue.
-
2. Re: HornetQ Server doesn't recover from Client Failure
roarke100 Jun 2, 2011 2:28 PM (in response to clebert.suconic)You say you are not sure that this can happen outside of a debugger, but we have gotten into this situation multiple times now (5 times in 4 weeks JBoss6 Hornetq 2.1.2), but because we have 90+ connections it is a large group of clients to sort through to see how it happened. We have the JBoss thread dump and JVM cpu monitor as proof of the situation, we just don't know what is causing it to happen on the client. We are working on it and if we find an issue in hornetq I will let you know. One speculation is that a client is nearly running out of memory seizing up the JVM with garbage collection. I started this discussion to raise a broader point though.
Wouldn't it be safer to err on the side that allows the server to close connections in all situations (even if the client is holding a connection open)?
-
3. Re: HornetQ Server doesn't recover from Client Failure
clebert.suconic Jun 2, 2011 4:08 PM (in response to roarke100)ok, a thread dump is good information.
Also: Can you upgrade to 2.2.2?
-
4. Re: HornetQ Server doesn't recover from Client Failure
clebert.suconic Jun 2, 2011 4:19 PM (in response to roarke100)Please attach both your client and server's dumps. Or any other information you have.
-
5. Re: HornetQ Server doesn't recover from Client Failure
roarke100 Jun 2, 2011 4:37 PM (in response to clebert.suconic)We can, but not immediately. We need to upgrade a number of servers simultaneously because of the client incompatibilities. I can upgrade our test environment, but the situation hasn't arisen without unnatural (i.e. a debugger) intervention in our test environment.
Attached is the server thread dump. Note the NettyConnection references. When the problem arises it seems to last around 30 min though we were only able to observer the actual time once when we happen have a jvm monitor on jboss at the time of the problem. During that time the the repeated thread dumps were consistent in regards to the NettyConnection class (specifically it appeared all the load was coming from the NettyConnection write method where it was looping in the while loop waiting for the socket to free up). During that 30min messaging was slow on all topics. I don't know which topic was having the connection trouble. After the 30min everything goes back to normal no messages lost. I don't have client thread dumps because the are distributed across our users desktops. We are searching there logs now if we find anything interesting we will pass it along.
Obviously there are a number of things that could be going wrong. It could be a server configuration issue, a networking issue, a hornetq issue, etc... I don't think you can diagnose the issue we are running into. Though any help you can provide is greatly appreciated. It just seems to me the server should be able to timeout the connection, and you can see in the thread dump it is trying to but can't because it is blocked by the socket write.
-
jboss_thread_dump.txt.zip 10.6 KB
-
-
6. Re: HornetQ Server doesn't recover from Client Failure
clebert.suconic Jun 2, 2011 4:57 PM (in response to roarke100)This dump is from your real environment. or with a debugger attached?
BTW: Just if you have it.. I prefer the output from the JDK better (kill -3) or the command line to get a dump. (I'm not sure if it's easier to understand or I'm just used to that)
-
7. Re: HornetQ Server doesn't recover from Client Failure
roarke100 Jun 2, 2011 5:14 PM (in response to clebert.suconic)This is from the real environment. This is all I have for a thread dump from the live environment. I can produce your prefered output from our test environment using the debugger. Give me a few minutes.
-
8. Re: HornetQ Server doesn't recover from Client Failure
clebert.suconic Jun 2, 2011 5:21 PM (in response to roarke100)I can produce your prefered output from our test environment using the debugger. Give me a few minutes.
Don't bother about that... Just if you had it.
-
9. Re: HornetQ Server doesn't recover from Client Failure
clebert.suconic Jun 2, 2011 5:27 PM (in response to clebert.suconic)What I see in your dumps:
In one of your threads, you have a SendBatched waiting the channel to unlock in order to send the packet to the client:
hread: Thread-21 (group:HornetQ-server-threads392518037-1632393274) : priority:5, demon:false, threadId:144, threadState:RUNNABLE
java.lang.Thread.yield(Native Method)
org.hornetq.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:167)
org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:199)
- locked <0x4779a7cc> (a java.lang.Object)
org.hornetq.core.protocol.core.impl.ChannelImpl.sendBatched(ChannelImpl.java:147)
org.hornetq.core.protocol.core.impl.CoreSessionCallback.sendMessage(CoreSessionCallback.java:75)
org.hornetq.core.server.impl.ServerConsumerImpl.deliverStandardMessage(ServerConsumerImpl.java:644)
org.hornetq.core.server.impl.ServerConsumerImpl.handle(ServerConsumerImpl.java:253)
- locked <0x27216d37> (a java.lang.Object)
org.hornetq.core.server.impl.QueueImpl.handle(QueueImpl.java:1384)
- locked <0x3f766bd> (a org.hornetq.core.server.impl.QueueImpl)
org.hornetq.core.server.impl.QueueImpl.deliver(QueueImpl.java:1175)
- locked <0x3f766bd> (a org.hornetq.core.server.impl.QueueImpl)
org.hornetq.core.server.impl.QueueImpl.access$700(QueueImpl.java:67)
org.hornetq.core.server.impl.QueueImpl$DeliverRunner.run(QueueImpl.java:1601)
org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
However, the channel is holding here:
java.net.SocketOutputStream.socketWrite0(Native Method)
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
org.jboss.netty.buffer.HeapChannelBuffer.getBytes(HeapChannelBuffer.java:116)
org.jboss.netty.buffer.DynamicChannelBuffer.getBytes(DynamicChannelBuffer.java:156)
org.jboss.netty.channel.socket.oio.OioWorker.write(OioWorker.java:119)
- locked <0x7b713328> (a java.net.SocketOutputStream)
org.jboss.netty.channel.socket.oio.OioServerSocketPipelineSink.handleAcceptedSocket(OioServerSocketPipelineSink.java:126)
org.jboss.netty.channel.socket.oio.OioServerSocketPipelineSink.eventSunk(OioServerSocketPipelineSink.java:64)
org.jboss.netty.channel.StaticChannelPipeline$StaticChannelHandlerContext.sendDownstream(StaticChannelPipeline.java:502)
org.jboss.netty.channel.SimpleChannelHandler.writeRequested(SimpleChannelHandler.java:304)
org.jboss.netty.channel.SimpleChannelHandler.handleDownstream(SimpleChannelHandler.java:266)
org.jboss.netty.channel.StaticChannelPipeline.sendDownstream(StaticChannelPipeline.java:385)
org.jboss.netty.channel.StaticChannelPipeline.sendDownstream(StaticChannelPipeline.java:380)
org.jboss.netty.channel.Channels.write(Channels.java:611)
org.jboss.netty.channel.Channels.write(Channels.java:578)
org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:258)
org.hornetq.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:206)
org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:199)
- locked <0x44d5c668> (a java.lang.Object)
org.hornetq.core.protocol.core.impl.ChannelImpl.sendBatched(ChannelImpl.java:147)
org.hornetq.core.protocol.core.impl.CoreSessionCallback.sendMessage(CoreSessionCallback.java:75)
org.hornetq.core.server.impl.ServerConsumerImpl.deliverStandardMessage(ServerConsumerImpl.java:644)
org.hornetq.core.server.impl.ServerConsumerImpl.handle(ServerConsumerImpl.java:253)
- locked <0x2ed4f926> (a java.lang.Object)
org.hornetq.core.server.impl.QueueImpl.handle(QueueImpl.java:1384)
- locked <0x203145cf> (a org.hornetq.core.server.impl.QueueImpl)
org.hornetq.core.server.impl.QueueImpl.deliver(QueueImpl.java:1175)
- locked <0x203145cf> (a org.hornetq.core.server.impl.QueueImpl)
org.hornetq.core.server.impl.QueueImpl.access$700(QueueImpl.java:67)
org.hornetq.core.server.impl.QueueImpl$DeliverRunner.run(QueueImpl.java:1601)
org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
For some reason the socketWrite0 is not finishing here.
You have Netty configured to run with OIO. If you had NIO=true on the NettyAcceptor, (Chapter 16.3 on the 2.2.2 doc) the socketWrite wouldn't affect the lock since the system would give up writing in 10 second.
Can you try NIO at your system?
You should probably tune up your Client. Or you can give a try to NIO on netty also.
-
10. Re: HornetQ Server doesn't recover from Client Failure
clebert.suconic Jun 2, 2011 5:30 PM (in response to clebert.suconic)I'm pinging Trustin and asking for help about the socketwrite0 not unblocking in case of a failure.
-
11. Re: HornetQ Server doesn't recover from Client Failure
roarke100 Jun 2, 2011 10:48 PM (in response to clebert.suconic)Ahhh... RTFM Turning on NIO seems to fix our issue.
In regards to the socket being held open it seems to be caused when one of our clients runs out of memory. Though I haven't been able to reproduce it there is evidence in our client logs which indicates as much. Once in this state and using OIO the server cannot cleanup the client connection once ttl is reached. The thread holding the server socket connection is holding a lock that is required to do the cleanup. As seen in this part of the attached thread dump.
Thread: hornetq-failure-check-thread : priority:5, demon:false, threadId:54, threadState:BLOCKED
- waiting on <0x27216d37> (a java.lang.Object)
org.hornetq.core.server.impl.ServerConsumerImpl.setStarted(ServerConsumerImpl.java:398)
org.hornetq.core.server.impl.ServerSessionImpl.doRollback(ServerSessionImpl.java:1145)
org.hornetq.core.server.impl.ServerSessionImpl.rollback(ServerSessionImpl.java:575)
org.hornetq.core.server.impl.ServerSessionImpl.doClose(ServerSessionImpl.java:250)
- locked <0x7bfe9a1> (a org.hornetq.core.server.impl.ServerSessionImpl)
org.hornetq.core.server.impl.ServerSessionImpl.access$100(ServerSessionImpl.java:76)
org.hornetq.core.server.impl.ServerSessionImpl$1.done(ServerSessionImpl.java:931)
org.hornetq.core.persistence.impl.journal.OperationContextImpl.executeOnCompletion(OperationContextImpl.java:158)
org.hornetq.core.persistence.impl.journal.JournalStorageManager.afterCompleteOperations(JournalStorageManager.java:394)
org.hornetq.core.server.impl.ServerSessionImpl.close(ServerSessionImpl.java:921)
org.hornetq.core.server.impl.ServerSessionImpl.connectionFailed(ServerSessionImpl.java:1079)
org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.callFailureListeners(RemotingConnectionImpl.java:482)
org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:254)
org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread.run(RemotingServiceImpl.java:541)
Thank you for your help with this!