9 Replies Latest reply on Aug 9, 2012 12:04 PM by clebert.suconic

    Hornetq stops responding

    bhuvangupta

      We are using JMS at the backend for a mobile application which caters for about 5000 concurrent connections at the moment.

      We have a pool of 100 queues where all these apps are connected via message selectors to pull their messages.

       

      Since the mobile networks are unreliable, they keep (re)connecting, disconnecting quite often. Whenever a app client is disconnected, we close the corresponding

      JMS Queue Receiver to stop processing the messages for that client.

       

      Because of these reconnection and creation/closure of sessions/consumers so frequently , the JMS server stops responding every few hours.

      When this happens we see following exceptions in the logs:

       

       

       

      * [Old I/O server worker (parentId: 207182797, [id: 0x0c595bcd, /10.142.211.140:5445])] 7-Aug 15:58:19,28 WARNING [QueueImpl]  Timed out on waiting for MessageCount

      java.lang.IllegalStateException: Timed out on waiting for MessageCount

              at org.hornetq.core.server.impl.QueueImpl.getMessageCount(QueueImpl.java:828)

              at org.hornetq.core.server.impl.ServerSessionImpl.executeQueueQuery(ServerSessionImpl.java:526)

              at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:222)

              at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:508)

              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:556)

              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:517)

              at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(RemotingServiceImpl.java:533)

              at org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.messageReceived(HornetQChannelHandler.java:73)

              at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:100)

              at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:372)

              at org.jboss.netty.channel.StaticChannelPipeline$StaticChannelHandlerContext.sendUpstream(StaticChannelPipeline.java:534)

              at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:287)

              at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.decode(HornetQFrameDecoder2.java:169)

              at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.messageReceived(HornetQFrameDecoder2.java:134)

              at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)

              at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:372)

              at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:367)

              at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)

              at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)

              at org.jboss.netty.channel.socket.oio.OioWorker.run(OioWorker.java:90)

              at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)

              at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)

              at org.jboss.netty.util.VirtualExecutorService$ChildExecutorRunnable.run(VirtualExecutorService.java:181)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

              at java.lang.Thread.run(Thread.java:679)

       

       

      When I take thread dump of the server, lot of threads are blocked at Queue level operations

       

      Old I/O server worker (parentId: 424711235, [id: 0x19509443, /10.142.211.140:5445])" prio=10 tid=0x0000000002082800 nid=0x7d13 waiting for monitor entry [0x00007fa1d3c0c000]

         java.lang.Thread.State: BLOCKED (on object monitor)

              at org.hornetq.core.server.impl.QueueImpl.removeConsumer(QueueImpl.java:576)

              - waiting to lock <0x000000050099fe88> (a org.hornetq.core.server.impl.QueueImpl)

              at org.hornetq.core.server.impl.ServerConsumerImpl.close(ServerConsumerImpl.java:342)

              at org.hornetq.core.server.impl.ServerSessionImpl.closeConsumer(ServerSessionImpl.java:1074)

              at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:427)

              at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:508)

              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:556)

              - locked <0x00000007a3791488> (a java.lang.Object)

       

       

      "Old I/O server worker (parentId: 424711235, [id: 0x19509443, /10.142.211.140:5445])" prio=10 tid=0x0000000001f9f000 nid=0x7cf2 waiting for monitor entry [0x00007fa1d9464000]

         java.lang.Thread.State: BLOCKED (on object monitor)        at org.hornetq.core.server.impl.QueueImpl.addConsumer(QueueImpl.java:557)

              - waiting to lock <0x000000050099fe88> (a org.hornetq.core.server.impl.QueueImpl)

              at org.hornetq.core.server.impl.ServerConsumerImpl.<init>(ServerConsumerImpl.java:178)

              at org.hornetq.core.server.impl.ServerSessionImpl.createConsumer(ServerSessionImpl.java:346)

              at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:214)

              at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:508)

              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:556)

              - locked <0x00000007a380ee08> (a java.lang.Object)

       

       

      "hornetq-failure-check-thread" prio=10 tid=0x00007fa1e44f5800 nid=0x6364 waiting for monitor entry [0x00007fa1dd6a6000]

         java.lang.Thread.State: BLOCKED (on object monitor)

              at org.hornetq.core.server.impl.QueueImpl.removeConsumer(QueueImpl.java:576)

              - waiting to lock <0x000000050099fe88> (a org.hornetq.core.server.impl.QueueImpl)

              at org.hornetq.core.server.impl.ServerConsumerImpl.close(ServerConsumerImpl.java:342)

              at org.hornetq.core.server.impl.ServerSessionImpl.doClose(ServerSessionImpl.java:306)

              - locked <0x0000000511a284e8> (a org.hornetq.core.server.impl.ServerSessionImpl)

              at org.hornetq.core.server.impl.ServerSessionImpl.access$100(ServerSessionImpl.java:86)

              at org.hornetq.core.server.impl.ServerSessionImpl$1.done(ServerSessionImpl.java:1053)

       

       

       

      The version of hornetq being used is  2.2.14 (hornetq-2.2.14.Final).

       

      Any clue on what could be wrong ?

        • 1. Re: Hornetq stops responding
          clebert.suconic

          you could try using NIO on Netty?

          • 2. Re: Hornetq stops responding
            clebert.suconic

            There's a chapter explaining how to set that up.

            • 3. Re: Hornetq stops responding
              clebert.suconic

              Can I see  your entire thread dump?

              • 4. Re: Hornetq stops responding
                bhuvangupta

                Here's the thread dump http://pastebin.com/KGwaVKvb

                 

                Also attached.

                • 5. Re: Hornetq stops responding
                  bhuvangupta

                  We are already using NIO on Netty. The server is running on an ec2 ubuntu VM with heapsize of 8 GB

                   

                   

                  Here are the relevant configurations:

                   

                      <id-cache-size>10000</id-cache-size>

                      <connection-ttl-override>-1</connection-ttl-override>

                      <persistence-enabled>true</persistence-enabled>

                      <security-enabled>false</security-enabled>

                      <paging-directory>${data.dir:../data}/paging</paging-directory>

                      <scheduled-thread-pool-max-size>50</scheduled-thread-pool-max-size>

                      <thread-pool-max-size>-1</thread-pool-max-size>

                      <bindings-directory>${data.dir:../data}/bindings</bindings-directory>

                      <journal-directory>${data.dir:../data}/journal</journal-directory>

                      <journal-file-size>20971520</journal-file-size>

                      <journal-min-files>25</journal-min-files>

                      <journal-type>NIO</journal-type>

                      <journal-compact-min-files>1000</journal-compact-min-files>

                      <journal-sync-non-transactional>false</journal-sync-non-transactional>

                      <journal-sync-transactional>false</journal-sync-transactional>

                      <large-messages-directory>${data.dir:../data}/large-messages</large-messages-directory>

                      <journal-buffer-timeout>100000000</journal-buffer-timeout>

                      <journal-max-io>1</journal-max-io>

                      <jmx-management-enabled>true</jmx-management-enabled>

                   

                   

                      <connectors>

                          <connector name="netty">

                              <factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

                              <param key="host" value="${hornetq.remoting.netty.host:127.0.0.1}"/>

                              <param key="port" value="${hornetq.remoting.netty.port:5445}"/>

                              <param key="batch-delay" value="50"/>

                              <param key="use-nio" value="true"/>

                              <param key="tcp-no-delay" value="true"/>

                   

                          </connector>

                   

                   

                   

                      </connectors>

                   

                   

                      <acceptors>

                          <acceptor name="netty">

                              <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

                              <param key="host" value="${hornetq.remoting.netty.host:127.0.0.1}"/>

                              <param key="port" value="${hornetq.remoting.netty.port:5445}"/>

                          </acceptor>

                      </acceptors>

                   

                  .....

                  ....

                  <address-settings>

                          <!--default for catch all-->

                          <address-setting match="#">

                              <dead-letter-address>jms.queue.DLQ</dead-letter-address>

                              <expiry-address>jms.queue.ExpiryQueue</expiry-address>

                              <redelivery-delay>2000</redelivery-delay>

                              <max-size-bytes>-1</max-size-bytes>

                              <message-counter-history-day-limit>100</message-counter-history-day-limit>

                              <page-size-bytes>1073741824</page-size-bytes>

                              <address-full-policy>PAGE</address-full-policy>

                          </address-setting>

                      </address-settings>

                   

                   

                  </configuration>

                  • 6. Re: Hornetq stops responding
                    clebert.suconic

                    You confused Journal NIO with Netty NIO which are totally different things... Look at chapter 16.4.1

                     

                     

                    On this case, you should use:

                     

                    Add this to the acceptor:

                     

                                   <acceptor name="netty">

                                            <factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

                                            <param key="host"  value="${hornetq.remoting.netty.host:localhost}"/>

                                            <param key="port"  value="${hornetq.remoting.netty.port:5445}"/>

                                            <param key="use-nio"  value="true"/>

                                    </acceptor>

                     

                     

                     

                    Also: I realize by looking at your thread dump that you are using a lot of create / delete consumer. If you are doing such thing you are basically using a message system in a synchronous way, being an anti-pattern for any message system.

                     

                    I don't see the system locked on the thread dump. I Just see thread waiting to lock the queue while the queue is delivering messages. I don't see anything waiting, but I may have missed it.

                     

                     

                    Take a look on use-nio=true on the accepto, but be careful with creating too many consumers and deleting them.  Also, take a look if you are not leaking consumers.. I mean.. if you are properly closing them in all your cases.

                    1 of 1 people found this helpful
                    • 7. Re: Hornetq stops responding
                      clebert.suconic
                      • 8. Re: Hornetq stops responding
                        bhuvangupta

                        Thanks Clebert. Didn't realize that I missed it. Thanks for the catch. Will try this.

                         

                        Right, VM is not deadlocked but hung temporarily in performing some actions which were timing out.

                         

                        Yes, at the moment we are creating a consumer as soon as a mobile app client is connected and close it as soon as the mobile app disconnects.

                        Since a mobile network is unstable, the app disconnects quite often and hence the corresponding consumer is created/closed frequently.

                         

                        We are trying to implement a "pause"/"resume" feature in the consumer - by adding a wait/notify hook in on Message.

                        If you have any better suggestion, please let me know.

                        • 9. Re: Hornetq stops responding
                          clebert.suconic

                          nio on Netty should help on this, as nothing will block on the queue. it will just delegate to netty and forget.