1 2 3 Previous Next 43 Replies Latest reply on Aug 11, 2011 6:07 PM by clebert.suconic

Deadlock when using netty NIO acceptor

carl.heymann Jul 9, 2011 2:26 PM

I've been trying to use netty acceptors with NIO, rather than blocking IO. HornetQ doesn't seem to work well in this configuration, freezing up after running for only a very short while.

Setup

2 cores, 32-bit linux, AIO persistence.

Acceptor configuration:

<factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

<param key="use-nio" value="true" />

</acceptor>

Paging enabled:

<max-size-bytes>104857600</max-size-bytes>

<page-size-bytes>4096</page-size-bytes>

<message-counter-history-day-limit>10</message-counter-history-day-limit>

<address-full-policy>PAGE</address-full-policy>

Results

The server freezes after a very short time running (almost immediately). I have to kill it with kill -9. With a profiler, I see a deadlock:

New I/O server worker #1-2 [BLOCKED; waiting to lock java.lang.Object@904497]

org.hornetq.core.server.impl.ServerConsumerImpl.promptDelivery(ServerConsumerImpl.java:664)

org.hornetq.core.server.impl.ServerConsumerImpl.readyForWriting(ServerConsumerImpl.java:642)

org.hornetq.core.remoting.impl.netty.NettyConnection.fireReady(NettyConnection.java:264)

org.hornetq.core.remoting.impl.netty.NettyAcceptor$Listener.connectionReadyForWrites(NettyAcceptor.java:695)

org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.channelInterestChanged(HornetQChannelHandler.java:65)

org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:136)

org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:362)

org.jboss.netty.channel.StaticChannelPipeline$StaticChannelHandlerContext.sendUpstream(StaticChannelPipeline.java:514)

org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelInterestChanged(SimpleChannelUpstreamHandler.java:183)

org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:116)

org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:362)

org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:357)

org.jboss.netty.channel.Channels.fireChannelInterestChanged(Channels.java:335)

org.jboss.netty.channel.socket.nio.NioSocketChannel$WriteRequestQueue.poll(NioSocketChannel.java:242)

org.jboss.netty.channel.socket.nio.NioSocketChannel$WriteRequestQueue.poll(NioSocketChannel.java:197)

org.jboss.netty.channel.socket.nio.NioWorker.write0(NioWorker.java:455)

org.jboss.netty.channel.socket.nio.NioWorker.writeFromUserCode(NioWorker.java:388)

org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:137)

org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:76)

org.jboss.netty.channel.StaticChannelPipeline$StaticChannelHandlerContext.sendDownstream(StaticChannelPipeline.java:502)

org.jboss.netty.channel.SimpleChannelHandler.writeRequested(SimpleChannelHandler.java:304)

org.jboss.netty.channel.SimpleChannelHandler.handleDownstream(SimpleChannelHandler.java:266)

org.jboss.netty.channel.StaticChannelPipeline.sendDownstream(StaticChannelPipeline.java:385)

org.jboss.netty.channel.StaticChannelPipeline.sendDownstream(StaticChannelPipeline.java:380)

org.jboss.netty.channel.Channels.write(Channels.java:611)

org.jboss.netty.channel.Channels.write(Channels.java:578)

org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:259)

org.hornetq.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:211)

org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:199)

org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:142)

org.hornetq.core.protocol.core.impl.CoreSessionCallback.sendProducerCreditsMessage(CoreSessionCallback.java:87)

org.hornetq.core.server.impl.ServerSessionImpl$2.run(ServerSessionImpl.java:1151)

org.hornetq.core.paging.impl.PagingStoreImpl.executeRunnableWhenMemoryAvailable(PagingStoreImpl.java:741)

org.hornetq.core.server.impl.ServerSessionImpl.requestProducerCredits(ServerSessionImpl.java:1147)

org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:473)

org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:474)

org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:496)

org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:457)

org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(RemotingServiceImpl.java:459)

org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.messageReceived(HornetQChannelHandler.java:73)

org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:100)

org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:362)

org.jboss.netty.channel.StaticChannelPipeline$StaticChannelHandlerContext.sendUpstream(StaticChannelPipeline.java:514)

org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:287)

org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.decode(HornetQFrameDecoder2.java:169)

org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.messageReceived(HornetQFrameDecoder2.java:134)

org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)

org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:362)

org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:357)

org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)

org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)

org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:350)

org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:281)

org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:201)

org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)

org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)

org.jboss.netty.util.VirtualExecutorService$ChildExecutorRunnable.run(VirtualExecutorService.java:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

java.lang.Thread.run(Thread.java:636)

Thread-5 (group:HornetQ-server-threads12254719-18227730) [BLOCKED; waiting to lock java.lang.Object@c7d960]

org.hornetq.core.protocol.core.impl.ChannelImpl.send(ChannelImpl.java:161)

org.hornetq.core.protocol.core.impl.ChannelImpl.sendBatched(ChannelImpl.java:147)

org.hornetq.core.protocol.core.impl.CoreSessionCallback.sendMessage(CoreSessionCallback.java:76)

org.hornetq.core.server.impl.ServerConsumerImpl.deliverStandardMessage(ServerConsumerImpl.java:704)

org.hornetq.core.server.impl.ServerConsumerImpl.handle(ServerConsumerImpl.java:291)

org.hornetq.core.server.impl.QueueImpl.handle(QueueImpl.java:2017)

org.hornetq.core.server.impl.QueueImpl.deliver(QueueImpl.java:1587)

org.hornetq.core.server.impl.QueueImpl.doPoll(QueueImpl.java:1472)

org.hornetq.core.server.impl.QueueImpl.access$1100(QueueImpl.java:72)

org.hornetq.core.server.impl.QueueImpl$ConcurrentPoller.run(QueueImpl.java:2299)

org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

java.lang.Thread.run(Thread.java:636)

It seems that both stack traces pass through ServerConsumerImpl and ChannelImpl, resulting in "lock" and "sendLock" being acquired in different orders, which causes the deadlock. I'm not sure what can be done to solve this, but I suspect HornetQChannelHandler.channelInterestChanged may have to defer execution.

Furthermore, the reaper thread is blocked at

org.hornetq.core.server.impl.QueueImpl.expireReferences()

org.hornetq.core.postoffice.impl.PostOfficeImpl$Reaper.run()

java.lang.Thread.run()

and the failure check thread is block at

org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.flush()

org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread.run()

but I'm not sure if these are problems, but it doesn't happen with use-nio=false.

Regards

Carl

1. Re: Deadlock when using netty NIO acceptor

carl.heymann Jul 9, 2011 3:37 PM (in response to carl.heymann)

HornetQ version: 2.2.5.Final

The test has a producer sending 3kiB ByteMessages to queueA via connection 1. 10 consumers pull from queueA via connection 2, and send again to queueB. 10 more consumers consume from queueB via connection 3. Everything is done via the JMS API, in JMS transactions. I can post the code if necessary.
Actions
2. Re: Deadlock when using netty NIO acceptor

clebert.suconic Jul 9, 2011 10:25 PM (in response to carl.heymann)

We had some issues in the past when doing high performance tests. The issue was a network treason on the kernel (as defined on the bugzilla)

Try updating your kernel. If it is still happening we will need more information.
Actions
3. Re: Deadlock when using netty NIO acceptor

carl.heymann Jul 10, 2011 5:05 AM (in response to clebert.suconic)

Are you referring to https://issues.jboss.org/browse/HORNETQ-435? This is definitely a different issue. I have batch-delay on, direct-deliver is default (true),

It only happens when using NIO acceptors, and I can detect the java-level deadlock using yourkit, leading to the results I posted above. I high-lighted the lines in the stack trace where the locks are acquired. It looks like it's just an old fashioned lock ordering issue.

Netty triggers "Channels.fireChannelInterestChanged" while "Channel.send" is executing. I would have expected netty events to be triggered asynchronously, but in this case it's not.

I just ran the test again, and got a slight variation of the same problem, involving the NettyConnection writeLock and again the ServerConsumerImpl lock.

Thread-4 (group:HornetQ-server-threads27921979-22629000) [RUNNABLE] CPU time: 2:55
java.util.concurrent.atomic.AtomicBoolean.compareAndSet(boolean, boolean)
org.hornetq.core.remoting.impl.netty.NettyConnection.write(HornetQBuffer, boolean, boolean)
org.hornetq.core.protocol.core.impl.ChannelImpl.send(Packet, boolean, boolean)
org.hornetq.core.protocol.core.impl.ChannelImpl.sendBatched(Packet)
org.hornetq.core.protocol.core.impl.CoreSessionCallback.sendMessage(ServerMessage, long, int)
org.hornetq.core.server.impl.ServerConsumerImpl.deliverStandardMessage(MessageReference, ServerMessage)
org.hornetq.core.server.impl.ServerConsumerImpl.handle(MessageReference)
org.hornetq.core.server.impl.QueueImpl.handle(MessageReference, Consumer)
org.hornetq.core.server.impl.QueueImpl.deliver()
org.hornetq.core.server.impl.QueueImpl.doPoll()
org.hornetq.core.server.impl.QueueImpl.access$1100(QueueImpl)
org.hornetq.core.server.impl.QueueImpl$ConcurrentPoller.run()
org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run()
java.lang.Thread.run()

New I/O server worker #1-2 [BLOCKED] CPU time: 0:08
org.hornetq.core.server.impl.ServerConsumerImpl.promptDelivery()
org.hornetq.core.server.impl.ServerConsumerImpl.readyForWriting(boolean)
org.hornetq.core.remoting.impl.netty.NettyConnection.fireReady(boolean)
org.hornetq.core.remoting.impl.netty.NettyAcceptor$Listener.connectionReadyForWrites(Object, boolean)
org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.channelInterestChanged(ChannelHandlerContext, ChannelStateEvent)
org.jboss.netty.channel.AbstractChannel.write(Object)
org.hornetq.core.remoting.impl.netty.NettyConnection.write(HornetQBuffer, boolean, boolean)
org.hornetq.core.protocol.core.impl.ChannelImpl.send(Packet, boolean, boolean)
org.hornetq.core.protocol.core.impl.ChannelImpl.send(Packet)
org.hornetq.core.protocol.core.impl.CoreSessionCallback.sendProducerCreditsMessage(int, SimpleString)
org.hornetq.core.server.impl.ServerSessionImpl$2.run()
org.hornetq.core.paging.impl.PagingStoreImpl.executeRunnableWhenMemoryAvailable(Runnable)
org.hornetq.core.server.impl.ServerSessionImpl.requestProducerCredits(SimpleString, int)
org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(Packet)
org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(Packet)
org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(Packet)
org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(Object, HornetQBuffer)
org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(Object, HornetQBuffer)
org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.messageReceived(ChannelHandlerContext, MessageEvent)
org.jboss.netty.channel.Channels.fireMessageReceived(ChannelHandlerContext, Object)
org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.decode(ChannelHandlerContext, ChannelBuffer)
org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.messageReceived(ChannelHandlerContext, MessageEvent)
java.lang.Thread.run()
Actions
4. Re: Deadlock when using netty NIO acceptor

carl.heymann Jul 10, 2011 6:22 AM (in response to clebert.suconic)

From the code comments, the lock in ServerConsumerImpl.promptDelivery() is there to protect largeMessageDeliverer. How about using a different lock for this purpose? It seems to me that this would solve the deadlock.
Actions
5. Re: Deadlock when using netty NIO acceptor

carl.heymann Jul 11, 2011 2:31 AM (in response to carl.heymann)
I added an additional lock for largeMessageDeliverer. This fixes the deadlock

I obviously don't know if this introduces other problems, or if the lock in ServerConsumerImpl.promptDelivery() was protecting more than largeMessageDeliverer.

I attach a patch file with the change, applicable to http://anonsvn.jboss.org/repos/hornetq/branches/Branch_2_2_EAP revision 10964.

hornetqNettyNioDeadlockFix.patch.zip 1.2 KB
Actions
6. Re: Deadlock when using netty NIO acceptor

clebert.suconic Jul 11, 2011 11:27 AM (in response to carl.heymann)

I see no indication of large messages in any previous thread dumps you have posted. How come this fixed your issue?
Actions
7. Re: Deadlock when using netty NIO acceptor

carl.heymann Jul 11, 2011 1:13 PM (in response to clebert.suconic)
The deadlock doesn't happen because of large messages. It happens because 2 different locks are (attempted to be) acquired in different orders by 2 separate threads.

From the first pair of yourkit deadlock stack traces I posted:
Thread A
passes through ServerConsumerImpl.handle(..), acquiring the "lock" lock object.
Just a few calls later, it also calls ChannelImpl.send(..), where it tries to acquire the "sendLock" lock object (in a synchronized block).
Thread B (running concurrently)
invokes CoreSessionCallback.sendProducerCreditsMessage(..)
which leads to a call to ChannelImpl.send(..), where it ackquires the "sendLock".
This leads to a netty channel write request, which in this case it is processed immediately, leading to the request passing downstream through the netty pipeline (synchronously!).
After writing, a netty event (channelInterestChanged) is sent back upstream, and handled by HornetQChannelHandler.
This leads to a call to ServerConsumerImpl.promptDelivery(), which tries to synchronize on the "lock" object, which is already being held by thread A.
Thread A is in turn still waiting on sendLock, already acquired by thread B, leading to the deadock.

My train of though was then:
In promptDelivery(), there is a synchronized block on the "lock" object. If this lock could be eliminated, then thread B would acquire only one of the two locks, and the deadlock is avoided.
In promptDelivery(), it also seems from code comments that the reason for the synchronized block, is to protect the state integrity of largeMessageDeliverer.
So, I thought that it might just work to use a different lock object, dedicated to protecting largeMessageDeliverer. This way, promptDelivery() doesn't have to synchronize on the the "lock" object, so there's no deadlock.

I didn't test the effects on large messages, because I don't know how to do it properly. I ran the build script and swapped out hornetq-core.jar in our test system's hornetq/lib directory. Where hornetq was locking up within 20 seconds or so with use-nio enabled before, it is now working smoothly in the same configuration. It didn't lock up at all today.
Actions
8. Re: Deadlock when using netty NIO acceptor

zohar_melamed Jul 11, 2011 1:43 PM (in response to clebert.suconic)

Hi

We get the same deadlock problem and it does indeed go away when we switch off NIO :

            Map<String, Object> connectionParams = new HashMap<String, Object>();
            connectionParams.put(TransportConstants.PORT_PROP_NAME, messagingServiceConfig.getMessagingPort());
            connectionParams.put(TransportConstants.HOST_PROP_NAME, messagingServiceConfig.getHost().value);
            connectionParams.put(TransportConstants.USE_NIO_PROP_NAME, false);
            TransportConfiguration transpConf = new TransportConfiguration(NettyAcceptorFactory.class.getName(), connectionParams);
Actions
9. Re: Deadlock when using netty NIO acceptor

clebert.suconic Jul 11, 2011 5:30 PM (in response to zohar_melamed)

Can you provide a test to replicate it?
Actions
10. Re: Deadlock when using netty NIO acceptor

zohar_melamed Jul 11, 2011 5:42 PM (in response to clebert.suconic)

sadly this happens on our uat env with a large number of consumers - 250+ so i dont have a test at this point. We have some additional issues at the moment, even after switching NIO off, with the broker and one the processing boxes running at 100% cpu , but the system not processing any messages. will need to investigate further.
Actions
11. Re: Deadlock when using netty NIO acceptor

carl.heymann Jul 12, 2011 7:45 AM (in response to clebert.suconic)
This is a test I hacked together. It extends ServiceTestBase, like many of your own tests, so you can probably get it running relatively quickly. It may be necessary to set the max server threadpool size to 2, before the test will run:
server.getConfiguration().setThreadPoolMaxSize(2);
(in the attached file it is 1, which doesn't seem to work on all machines).

If you don't see it immediately, run it a few times, it's not guaranteed to create a deadlock, but it happens about 4/5 times on my laptop. I verified that the deadlock is the one I posted above, by attaching yourkit on the running test case when it deadlocks.

Unfortunately, deadlock problems are not easy to test for, they're better addressed with code review.

The deadlock manifests by the transaction counter (output each second) getting stuck. E.g.:
Not done yet.. 59; 139
Not done yet.. 58; 227
Not done yet.. 57; 424
Not done yet.. 56; 609
...
Not done yet.. 34; 7451
Not done yet.. 33; 7861
Not done yet.. 32; 8298
Not done yet.. 31; 8405
Not done yet.. 30; 8405
Not done yet.. 29; 8405
Not done yet.. 28; 8405
Not done yet.. 27; 8405
...
Not done yet.. 2; 8405
Not done yet.. 1; 8405
Not done yet.. 0; 8405
Done..8405, expected 9000

If you apply the patch I posted above, the deadlock never happens.

JmsNettyNioStressTest.java.zip 2.9 KB
Actions
12. Re: Deadlock when using netty NIO acceptor

carl.heymann Jul 15, 2011 4:49 AM (in response to clebert.suconic)

What is the next step? Should I create a JIRA?
Actions
13. Re: Deadlock when using netty NIO acceptor

clebert.suconic Jul 15, 2011 11:29 AM (in response to carl.heymann)

Yes please, open a JIRA and I will take a look.
Actions
14. Re: Deadlock when using netty NIO acceptor

postmaxin Jul 26, 2011 10:33 PM (in response to carl.heymann)

I have run into this deadlock issue as well, but only during stress testing, not in produciton. I have found two things that greatly reduce the problem -- probably because they each reduce contention -- but I still haven't found something that eliminates it. There's definiately a deadlock bug to be fixed in there.

1. Don't allow the hundreds of producers to run in tight loops, even over the network. A few Thread.sleep(1)'s here and there work wonders.

2. Make sure you have a confirmationWindowSize. The problem seems to be exasperated if you reconnect vs. reattach.

Also, the problem appears to be worse when there are a lot (hundreds of gigabytes worth) of messages paged / journalled. In my stress testing simulations, I have stripped away HornetQ's producer rate-limiting, and always simulate at least 50-80 queues being produced, but with only 45-75 of them being consumed. This exercises the server pretty well. When I'm producing and consuming at a fairly even rate, a single server processes roughly 4,000+ messages per second with full persistence and journalling. When I start consuming a queue that's accumulated hundreds of millions of message, the rate of production immediately drops, but the rate of consumption jumps to 10,000+, which would be expected I guess, because consuming is less disk writes than producing.

One other thing that makes deadlock far more likely is iterating over the JMX counters while embedded (using java.lang.management.ManagementFactory.getPlatformMBeanServer() and recursively iterating over everything in queryMBeans(null,null)). I narrowed it down to a possibly involving AtomicIntegers and synchronized methods in QueueImpl.java, but when I put a Thread.sleep(1) between each JMX value fetch, my problems went away, presumedly because other threads were given a chance to release their locks.

Yes, I know I'm not presenting a use case here.. I'm trying to beat HornetQ up as hard as I can and see what, beyond hardware failure, the worst case scenario is. Unfortunatley, it's daemons locking up and having to be 'kill -9'ed and the occasional, extremely rare database corruption. So I'm curious how replication is going to affect these numbers and reliability once 2.3 comes around.
Actions

1 2 3 Previous Next

Go to original post