1 2 3 Previous Next 31 Replies Latest reply on Oct 6, 2010 9:08 AM by awelynant Go to original post
      • 15. Re: Severe message loss using Stomp with "direct-deliver" enabled
        timfox

        BTW if you're talking about STOMP (it's not clear), I'm not an expert on the STOMP implementation, but looking at the code I can see the consumer window size has been hardcoded to -1. This would mean an unbounded buffer.

         

        It should really be matched to the TCP buffer size.

         

        Jeff - can you advise why it was coded this way?

        • 16. Re: Severe message loss using Stomp with "direct-deliver" enabled
          david.taylor

          Sigh. Firstly, no messages are "lost". They are in the process of delivery, and will happily get requeued if they are not acked and the session closes.

           

          We are seeing message loss with the configuration I have described. Specifically, the following sequence of events results in lost messages that are not requeued even after expiration of the TTL. Perhaps there is something wrong with our HQ or Stomp client configuration, but that is not at all obvious at this point. Of course, the reason for starting this thread in the first place was to seek advice on configuration and problem solving.

           

          1) Delete HQ journal and binding files from disk

          2) Startup HQ

          3) Start a STOMP consumer and then terminate the process (i.e. unclean shutdown)

          4) Start a new STOMP comsumer

          5) Start a producer application that submits several thounsand messages to the queue

           

          Observations:

           

          - Approximately every other message is received by the consumer started in step 4)

          - The missing messages are never received by the active consumer no matter how long it is left running

           

          If you read the chapter on flow control, you will see that each consumer maintains a window. This is the total size of messages that can be sent to the consumer without the consumer requesting more credits. It's completely configurable and it's default value is 1 MiB. This determines how many messages can be queued to be sent to a consumer.

          Fair enough, I had not read that part of the documentation in detail. The question is how does flow control work when using a STOMP protocol client? I have read everything I could find regarding the HornetQ STOMP implementation in the docs and on Google.

           

          Secondly, TCP send buffer size. Default value for this is 32kiB. On faster networks it is recommended to set this to a higher figure e.g. 1 MiB.

           

          Understood. Not all messaging use-cases are concerned with high performance so the default buffer may well be appropriate.

           

          If you really understand TCP as well as you say you do, you will understand that, in the absence of consumer flow control, it's TCP flow control that determines how many messages are "lost in the ether". With a 1 MiB TCP send buffer that's more than "one or two messages", unless of course your messages are very large.

           

          Again, this is fully configurable.

           

          Yes, I do understand TCP very well since I am an engineer that has worked in that field for 20+ years. What is not at all clear is the inner workings of HornetQ and how it makes use of the TCP/IP stack. In my experience, applications actively transfering data over TCP detect session failures rather quickly and respond appropriately. HornetQ seems to layer session handling and buffering sematics on top of the TCP transport which give it a wholly different feel. I was expecting a fast-fail sort of behavior, but am seeing something entirely different by default.

           

          Before you start making ranting claims about HornetQ reliability I suggest you fully read the relevant chapters in the user manual, and preferably a book on TCP.

           

          How about Douglas E. Comer, Third Edition?  On a more serious note, I have read most of the HQ documentation. Problem is that HornetQ is a somewhat complex product with many configuration options. The documentation is quite good in many respects, but is no substitute for experience working with the product or having intimate knowledge of its inner workings.

           

          My position regarding reliability is simply based on relevant real-world observations, not unrealistic expectations on how a queuing system should work. Please provide some more constructive input that addresess our issues directly and I promise to be more complementary Simply stating that something is working as designed is not very helpful.

           

          As I said before, HornetQ behaviour is completely configurable in this regard, and any buffering due to the way TCP works is unavoidable and would be the same with any other messaging system.

           

          I agree with you that the fundamentals of TCP are the same for any messaging product. That said, I believe configuration has played a major role in creating the current situation. The out-of-the box configuration in combination with STOMP seems to yield an undersireable combination that compares unfavorable with other queuing solutions we have used. A similar simple point-to-point queuing application implemented with MSMQ, for example, does not exhibit the same behaviors we have observed with HQ.

           

          Perhaps a small example of how to implement point-to-point reliable queuing using STOMP that handles consumer crashes gracefully would be appropriate. Please take this as a positive suggestion since that is my intention.

           

          Regards,

          David

          • 17. Re: Severe message loss using Stomp with "direct-deliver" enabled
            timfox

            I'm not an expert on the STOMP implementation (didn't write it), but it *should* have the same or similar semantics with respect to connection-ttl etc as the core protocol, which is where 95% of our users use.

             

            Looking at the STOMP code now, I can see a couple of oddities, e.g. consumer window size is hardcoded to -1, and connection-ttl to zero.

             

            I will ask Jeff to comment on this, since he is the author and perhaps there is a reason for this.

            • 18. Re: Severe message loss using Stomp with "direct-deliver" enabled
              jmesnil

              David Taylor wrote:

               

              We are seeing message loss with the configuration I have described. Specifically, the following sequence of events results in lost messages that are not requeued even after expiration of the TTL. Perhaps there is something wrong with our HQ or Stomp client configuration, but that is not at all obvious at this point.

              The bug is definitely in HornetQ's STOMP implementation: https://jira.jboss.org/browse/HORNETQ-526

              • 19. Re: Severe message loss using Stomp with "direct-deliver" enabled
                timfox

                Jeff Mesnil wrote:

                 

                David Taylor wrote:

                 

                We are seeing message loss with the configuration I have described. Specifically, the following sequence of events results in lost messages that are not requeued even after expiration of the TTL. Perhaps there is something wrong with our HQ or Stomp client configuration, but that is not at all obvious at this point.

                The bug is definitely in HornetQ's STOMP implementation: https://jira.jboss.org/browse/HORNETQ-526

                It's actually more complex than that. connection-ttl being zero is one problem, but the other problem is the STOMP implementation is setting consumer-windows-size to -1 which disables consumer flow control. This means that even if the TCP connection send buffer is full, the STOMP implementation will continue to send messages to Netty, eventually resulting in OOM since the Netty write queue is unbounded.

                 

                Since there is no flow control built into the STOMP protocol, you need to use TCP flow control to enabled and disable the consumer. Off the top of my head, something like the following:

                 

                /*
                * Copyright 2010 Red Hat, Inc.
                * Red Hat licenses this file to you under the Apache License, version
                * 2.0 (the "License"); you may not use this file except in compliance
                * with the License.  You may obtain a copy of the License at
                *    http://www.apache.org/licenses/LICENSE-2.0
                * Unless required by applicable law or agreed to in writing, software
                * distributed under the License is distributed on an "AS IS" BASIS,
                * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
                * implied.  See the License for the specific language governing
                * permissions and limitations under the License.
                */

                 

                package org.hornetq.tests.local;

                 

                import java.util.Queue;
                import java.util.concurrent.ConcurrentLinkedQueue;
                import java.util.concurrent.atomic.AtomicInteger;

                 

                import org.jboss.netty.buffer.ChannelBuffer;
                import org.jboss.netty.channel.Channel;
                import org.jboss.netty.channel.ChannelFuture;
                import org.jboss.netty.channel.ChannelFutureListener;

                 

                /**
                * A NettyFlowControl
                *
                * @author tim
                *
                *
                */
                public class NettyFlowControl implements ChannelFutureListener
                {
                   private AtomicInteger queueSize = new AtomicInteger(0);

                 

                   private Channel channel;
                  
                   private Queue<Integer> queue = new ConcurrentLinkedQueue<Integer>();
                  
                   private final int maxDeliveringBytes;
                  
                   private final int reenableBytes;
                  
                   public void operationComplete(ChannelFuture arg0) throws Exception
                   {
                      int bytes = queue.poll();
                     
                      int bytesAfter = queueSize.addAndGet(-bytes);
                     
                      if (bytesAfter < reenableBytes)
                      {
                         //TODO re-enable consumer - send it some credits
                      }
                   }
                  
                   public void write(ChannelBuffer buffer)
                   {
                      int bytes = buffer.writableBytes();
                     
                      queue.add(bytes);
                     
                      int bytesAfter = queueSize.addAndGet(bytes);
                     
                      ChannelFuture future = channel.write(buffer);
                     
                      future.addListener(this);
                     
                      if (bytesAfter >= maxDeliveringBytes)
                      {
                         //TODO disable consumer - remove credits
                      }
                   }

                 

                }

                 

                I suggest also talking to Trustin in case Netty already does something like this.

                • 20. Re: Severe message loss using Stomp with "direct-deliver" enabled
                  awelynant

                  I'm seeing problems using the PHP Stomp extension from (http://php.net/manual/en/book.stomp.php) due to the handling of connectionTTL and connectionWindowSize.  Using the basic example code from the php site, we get a "ghost" consumer.  This seems to happen when there are more then 1 message in the queue and the PHP client connects.  All of the messages are sent (I see this in wireshark) but the client only handles a single message and closes.  The server still lists a consumer (seen in jConsole) and all of the messages (besides the initial one that was ack-ed) are in the deliveryCount list.  Because connectionTTL isn't honoured this consumer is never cleared and the messages are never re-queued.  The messages are never received by any additional clients that connect later.

                   

                  <?php

                   

                  $queue  = 'jms.queue.ExampleQueue';

                  $msg    = 'bar';

                   

                  /* connection */

                  try {

                      $stomp = new Stomp('tcp://192.168.61.6:61613','guest','guest');

                  } catch(StompException $e) {

                      die('Connection failed: ' . $e->getMessage());

                  }

                   

                  /* send a message to the queue 'foo' */

                  $stomp->send($queue, $msg);

                   

                  /* subscribe to messages from the queue 'foo' */

                  $stomp->subscribe($queue);

                   

                  /* read a frame */

                  $frame = $stomp->readFrame();

                  var_dump($frame);

                  $stomp->ack($frame);

                  /* remove the subscription */

                  $stomp->unsubscribe($queue);   

                   

                  /* close connection */

                  unset($stomp);

                   

                  ?>

                  <?php
                  $queue  = 'jms.queue.ExampleQueue';
                  $msg    = 'bar';
                  /* connection */
                  try {
                      $stomp = new Stomp('tcp://192.168.61.6:61613','guest','guest');
                  } catch(StompException $e) {
                      die('Connection failed: ' . $e->getMessage());
                  }
                  /* send a message to the queue 'foo' */
                  $stomp->send($queue, $msg);
                  /* subscribe to messages from the queue 'foo' */
                  $stomp->subscribe($queue);
                  /* read a frame */
                  $frame = $stomp->readFrame();
                  var_dump($frame);
                  //if ($frame->body === $msg) {
                      /* acknowledge that the frame was received */
                  //}
                  /* remove the subscription */
                  $stomp->unsubscribe($queue);
                     
                  $stomp->ack($frame);
                  /* close connection */
                  print $stomp->error();
                  unset($stomp);
                  ?>

                  • 21. Re: Severe message loss using Stomp with "direct-deliver" enabled
                    timfox

                    I am going to default connection-ttl for STOMP to 1 minute, which is the same as in the core protocol. This can be overridden on the server side using connection-ttl-override.

                     

                    Please note that the STOMP protocol does not contain any heartbeat frames (unlike the core protoco), therefore it is the user's responsibility to ensure data is sent before connection-ttl kicks in or the server will assume the connection is dead and clear up its resources.

                     

                    A well behaved client will always send a DISCONNECT frame before closing (your test program does not do this, but I think that is deliberate to test the cleanup behaviour), this will cause any server side resources to be cleaned up synchronously without having to wait for connection-ttl.

                    • 22. Re: Severe message loss using Stomp with "direct-deliver" enabled
                      timfox

                      The changes to connection-ttl are now in TRUNK, can you please take a spin and feedback. Thx

                      • 23. Re: Severe message loss using Stomp with "direct-deliver" enabled
                        timfox

                        Also, can you please check whether you're using NIO or OIO on the server side for the STOMP acceptor.

                         

                        In TRUNK OIO would be used by default. But if you're using NIO this may lead to OOM when no flow control is enabled.

                         

                        set param "use-nio" to false on the acceptor

                         

                        <param key="use-nio" value="falsed"/>

                        • 24. Re: Severe message loss using Stomp with "direct-deliver" enabled
                          awelynant

                          Seeing really odd behaviour with 3 PHP stomp clients now.  They all seem to behave "wrong" when they subscribe to a queue and receive a stream of messages.  These clients don't handle local buffering and will sometimes return two "MESSAGE" frames as a single message.  If the first is ACK-ed, that's when the ghost consumer problem happens.  Hornet reports that there is a consumer on the queue but that process has actually closed and finished.  Sometimes the DISCONNECT frame is sent properly, sometimes not.  I haven't really nailed down what causes this issue, though it seems to happen on all three so maybe it is somehow related to PHPs socket handling??  I'm not a php dev, so I have no idea.  For sure, the connectionTTL is still not working correctly.  I've set the connection-ttl-override to 30000 (so 30s) and after a minute when I check the consumer count (using jconsole) there is still a consumer.  We apparently (before I started working on this) had a similar problem when using stompconnect and the solution was to patch it to only send a single message at a time.  If that's the case, then it seems the "solution" for this problem with hornet may also be to get the connection-window-size to work with STOMP.  Anyone else have problems like this?  Am I looking in the wrong place?

                          • 25. Re: Severe message loss using Stomp with "direct-deliver" enabled
                            david.taylor

                            Tim & Jeff,

                             

                            Nice progress on the Stomp issue. I will take a look at the changes in TRUNK and rerun our tests.

                             

                            Some questions related to TTL and the lack of a heart beat in the Stomp protocol:

                             

                            1) What is the best approach to keeping the session alive in the absence of actual message flow? Is there some simple NOP type of message that can be sent periodically? The lack of a keep-alive mechanism seems to be a known issue with STOMP since the "Ideas" for v1.1 mention a "ping" feature.

                             

                            2) How can a STOMP client determine if its session has expired and been cleaned up at the broker? The client may need to detect this failure to deal with any unacknowledged message(s) it may have. Perhaps there are some sematics related to handling of the STOMP SUBSCRIBE id header I have missed.

                             

                            3) What are the session recovery/expiration semantics when a Stomp client is using explicit message acknowledgements? I am particularly interested in the case where a connection failure occurs and is subsequently reestablished after the configured broker TTL has expired.

                             

                            Regards,

                            David

                            • 26. Re: Severe message loss using Stomp with "direct-deliver" enabled
                              david.taylor

                              I just noticed that the ActiveMQ folks have implemented some extensions to STOMP to deal with JMS sematics. Perhaps some of this may be relevant to the current discussion.

                               

                              http://activemq.apache.org/stomp.html

                               

                              Regards,

                              David

                              • 27. Re: Severe message loss using Stomp with "direct-deliver" enabled
                                timfox

                                Craig - did you attach your test program as requested on IRC? I can't see it here or on the JIRA.

                                • 28. Re: Severe message loss using Stomp with "direct-deliver" enabled
                                  timfox

                                  Craig - I am unable to replicate your issue from TRUNK.

                                   

                                  I am using the test code you pasted:

                                   

                                  <?php

                                   

                                  $queue  = 'jms.queue.ExampleQueue';

                                  $msg    = 'bar';

                                   

                                  /* connection */

                                  try {

                                      $stomp = new Stomp('tcp://192.168.61.6:61613','guest','guest');

                                  } catch(StompException $e) {

                                      die('Connection failed: ' . $e->getMessage());

                                  }

                                   

                                  /* send a message to the queue 'foo' */

                                  $stomp->send($queue, $msg);

                                   

                                  /* subscribe to messages from the queue 'foo' */

                                  $stomp->subscribe($queue);

                                   

                                  /* read a frame */

                                  $frame = $stomp->readFrame();

                                  var_dump($frame);

                                  $stomp->ack($frame);

                                  /* remove the subscription */

                                  $stomp->unsubscribe($queue);  

                                   

                                  /* close connection */

                                  unset($stomp);

                                   

                                   

                                  I have added an extra sleep(10) before the call to unset($stomp)

                                   

                                  I run the server, I run the client, wait until it reaches the sleep, and kill -9 the client process. The session is cleaned up fine.

                                   

                                  If you can give *exact* instructions to replicate that would be useful.

                                  • 29. Re: Severe message loss using Stomp with "direct-deliver" enabled
                                    awelynant

                                    I realize that what I posted was a simplified version of what we are doing.  I've noticed that the php clients and hornetq seem to behave correctly in the following scenarios:

                                      * There are 0 messages in the queue.  The client connects, subscribes, times out in the readFrame method, unsubscribes and disconnects.  JMX reports that there are 0 consumers on the queue.

                                      * There is 1 messages in the queue.  The client connects, subscribes, receives the message, ACKs the message, unsubscribes and disconnects.  JMX reports that there are 0 consumers on the queue.

                                      * There are many messages in the queue.  The client connects, subscribes, the messages are sent (confirmed via wireshark) but NO messages are ACK-ed, the client unsubscribes and disconnects.  In this case, JMX reports 0 consumers and any new consumer receives those messages again (which is expected given that none where ACK-ed).

                                     

                                    The following is the scenario that seems to cause this "ghost consumer":

                                      * There are many messages in the queue.  The client connects, subscribes, the messages are sent (again via wireshark) but only the first message is ACK-ed, the client unsubscribes and the client "disconnects".  Our usages is a "slow consumer" (I think that's the term) and we often only want to deal with 1 message at a time.  In this case, JMX continues to report 1 consumer on the queue.  All the messages are listed in the deliveryCount bucket.  I have also noticed that none of the PHP clients seem to correctly send the DISCONNECT frame in this scenario.  I don't know why.  Some of the clients send it as part of the destructor and there is no way to explicitly send it.  Others do have a "disconnect" method, but that doesn't seem to work in this situation either (confimed via wireshark).  After waiting for more then the connectionTTL (default or override), JMX still reports a consumer and any new messages appear to be "sent" to that consumer.  Also, using JMX to close what appears to be the connection in org.hornetq.Server.Core.closeConnectionsForAddress operation doesn't seem to change the consumer count.

                                     

                                     

                                    All of the clients I've tried have a readFrame method that should read a single message.  Our usage, being "slow consumers", we really only want to deal with a small number of messages (usually 1) at a time.  I realize this may not be a traditional use-case, but that's what we have.  We are planning to move away from any consumption of messages by PHP (only produce) but that's not ready yet.

                                     

                                    I will include a tar that has an additional php client (from http://stomp.fusesource.org/documentation/php/book.html) and a testme.php script.  Fix the script to connect to the right server (and credentials).  If you can, please try the three scenarios (0 msgs, 1 msg, N msgs no-ACK, N msgs ACK) and see what you get.  I really appreciate the effort to look at this.

                                     

                                    BTW, do you want me to continue to respond here?  Or in the JIRA ticket?  Thanks again.