8 Replies Latest reply on May 15, 2009 4:17 AM by trustin

    Potential Netty problem with synchronously closing connectio

    timfox

      We have a JBM test which quickly creates and closes connections in a loop which eventually fails with:

      New I/O server boss #1 (channelId: 28338721, /127.0.0.1:5445) 11:41:29,851 WARN [NioServerSocketPipelineSink] Failed to accept a connection.
      java.io.IOException: Too many open files
       at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
       at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
       at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:205)
       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
       at java.lang.Thread.run(Thread.java:595)
      


      Creating a JBM JMS and closing it involves creating a Netty bootstrap, creating a channel, closing the channel and closing the bootstrap.

      This is done in quick succession.

      Debugging this a bit, it seems that the netty channel disconnected event can arrive at the server significantly later than when the channel is closed from the client side.

      So.. over time we end up with the rate of creating channels exceeds the rate that channel disconnected arrives at the server and the server soon runs out of file handles.

      I would have thought disconnection would be synchronous, otherwise a client could easily create a DoS attack.

      Any ideas?

        • 1. Re: Potential Netty problem with synchronously closing conne
          timfox

          As a note if I add a small sleep (100 ms) between each connection creation and close, then the test will run ad-infinitum with no problems.

          The test is org.jboss.test.messaging.jms.ConnectionTest::testConnectionListenerBug

          • 2. Re: Potential Netty problem with synchronously closing conne
            trustin

            I ran the mentioned test with increase loop count (20000) and could not reproduce the problem. I'm running it Integer.MAX_VALUE time now. How long did it take for you to make it run out of handle? What is the OS?

            • 3. Re: Potential Netty problem with synchronously closing conne
              trustin

              Interestingly, the test hangs at around 59000 with the following message:

              May 11, 2009 5:12:18 PM org.jboss.messaging.core.logging.Logger warn
              WARNING: Connection failure has been detected Did not receive ping on connection. It is likely a client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed.:3


              Not sure this is a Netty problem or not, but I still can't get a 'too many open files' error.

              • 4. Re: Potential Netty problem with synchronously closing conne
                trustin

                I was able to reproduce the issue finally using the following simple test code:

                import java.net.InetAddress;
                import java.net.InetSocketAddress;
                import java.net.Socket;
                import java.util.concurrent.Executors;
                import java.util.concurrent.atomic.AtomicInteger;
                
                import org.jboss.netty.bootstrap.ServerBootstrap;
                import org.jboss.netty.channel.ChannelFactory;
                import org.jboss.netty.channel.ChannelHandlerContext;
                import org.jboss.netty.channel.ChannelPipelineCoverage;
                import org.jboss.netty.channel.ChannelStateEvent;
                import org.jboss.netty.channel.ExceptionEvent;
                import org.jboss.netty.channel.MessageEvent;
                import org.jboss.netty.channel.SimpleChannelUpstreamHandler;
                import org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory;
                
                public class Test {
                
                 public static void main(String[] args) throws Throwable {
                 ChannelFactory sf = new NioServerSocketChannelFactory(
                 Executors.newCachedThreadPool(),
                 Executors.newCachedThreadPool(), 1);
                 ServerBootstrap sb = new ServerBootstrap(sf);
                 sb.getPipeline().addLast("handler", new ServerHandler());
                 sb.setOption("backlog", 1024);
                 sb.bind(new InetSocketAddress(8080));
                
                 InetAddress dstAddr = InetAddress.getByName("localhost");
                 long startTime = System.currentTimeMillis();
                 for (int i = 0; i < 1048576; i ++) {
                 Socket s = new Socket(dstAddr, 8080);
                 s.close();
                 if ((i + 1) % 1000 == 0) {
                 long endTime = System.currentTimeMillis();
                 System.err.println(i + 1 + ": TOOK " + (endTime - startTime) + " MS");
                 startTime = endTime;
                 }
                 }
                
                 sb.releaseExternalResources();
                 }
                
                 @ChannelPipelineCoverage("all")
                 static class ServerHandler extends SimpleChannelUpstreamHandler {
                
                 private static final int THRESHOLD = 4;
                 private final AtomicInteger cnt = new AtomicInteger();
                 private final AtomicInteger lateCnt = new AtomicInteger();
                 @Override
                 public void channelOpen(ChannelHandlerContext ctx, ChannelStateEvent e)
                 throws Exception {
                 if (cnt.incrementAndGet() > THRESHOLD) {
                 lateCnt.incrementAndGet();
                 }
                 }
                
                 @Override
                 public void messageReceived(ChannelHandlerContext ctx, MessageEvent e) {
                 // Wait until the client closes the connection.
                 }
                
                 @Override
                 public void channelClosed(ChannelHandlerContext ctx, ChannelStateEvent e)
                 throws Exception {
                 if (cnt.decrementAndGet() == THRESHOLD) {
                 System.err.println("LATE CLOSURES: " + lateCnt.getAndSet(0));
                 }
                 }
                
                 @Override
                 public void exceptionCaught(ChannelHandlerContext ctx, ExceptionEvent e) {
                 System.err.println(e);
                 e.getCause().printStackTrace();
                 e.getChannel().close();
                 }
                 }
                }


                Sometimes, 'LATE CLOSURES' goes up to around 2000, which can lead to 'too many open files' for a normal user. I set the ulimit value of my local account to 10240, so I was not getting the error.

                Let me try to optimize Netty so that it doesn't go up too much, but there's inevitable indeterminism because it's non-blocking I/O.

                • 5. Re: Potential Netty problem with synchronously closing conne
                  trustin

                   

                  "trustin" wrote:
                  Let me try to optimize Netty so that it doesn't go up too much, but there's inevitable indeterminism because it's non-blocking I/O.


                  .. which means the ulimit value of the normal user of the production environment must be increased anyway. It's simple to bump it up: http://gleamynode.net/articles/1557/

                  • 6. Re: Potential Netty problem with synchronously closing conne
                    trustin

                    I have made the proof-of-concept modification on Netty that suspends ServerSocket.accept() when the number of connections reaches a certain threshold, and it seems to make the ConnectionTest.testManyConnections() pass according to Tim.

                    I'm currently designing more generic interface that allows you to:

                    * suspend / resume ServerSocket.accept()
                    * close the accepted connection immediately
                    * close the accepted connection immediately with a good bye (e.g. 'server busy', 'go away') message

                    The point of this modification is to prune malicious connections as early as possible so that the server does not run out of file descriptors and to temporarily suspend ServerSocket.accept() if necessary. Therefore, a time-consuming tasks cannot be executed in this context.

                    Please let me know if you like it or not. If not, let me know what features are missing.

                    Thanks in advance!

                    • 8. Re: Potential Netty problem with synchronously closing conne
                      trustin

                      I've just modified Netty so that Channel.close() in channelConnected handler method can close the connection immediately not causing the too many open files error, and started to wonder if you really want to suspend accepting an incoming connection instead of closing them quickly.

                      Even if you suspended an accept operation, the incoming connections are accepted by the operating system and hold in the backlog of the server socket. If the backlog is full, further connection attempt from the client will fail. When you resume the accept operation, the oldest connections will be served while the latest ones have been failed. I'm not sure this is the expected behavior. I would close the connection so that the client can retry later and all connections are served anyway rather than being timed out due to the full backlog. WDYT?