12 Replies Latest reply on Feb 18, 2010 11:26 AM by timfox

    UTF-8 encoding in Stomp frames

    mjustin

      Sending and receiving of messages with Unicode characters (UTF-8 encoded) gives some unexpected results.

      (Later I will try a Java JMS client to consume the message to find if  the problem is the decoding or the encoding on the server side.)

       

      For five of 12 languages there is a mismatch between the sent content and the received, for example:


      expected: <ગુજરાતી> but was: <ગ�?જરાતી>

       

       

      The test data uses these language names:

       

        am = '&#x12A0;&#x121B;&#x122D;&#x129B;';
        ar = '&#x0627;&#x0644;&#x0639;&#x0631;&#x0628;&#x064A;&#x0629;';
        gu = '&#x0A97;&#x0AC1;&#x0A9C;&#x0AB0;&#x0ABE;&#x0AA4;&#x0AC0;';  // fails
        he = '&#x05E2;&#x05D1;&#x05E8;&#x05D9;&#x05EA;';
        hi = '&#x0939;&#x093F;&#x0928;&#x094D;&#x0926;&#x0940;';  // fails
        kn = '&#x0C95;&#x0CA8;&#x0CCD;&#x0CA8;&#x0CA1;';  // fails
        mr = '&#x092E;&#x0930;&#x093E;&#x0920;&#x0940;';
        ja = '&#x65E5;&#x672C;&#x8A9E;';
        ta = '&#x0BA4;&#x0BAE;&#x0BBF;&#x0BB4;&#x0BCD;';  // fails
        te = '&#x0C24;&#x0C46;&#x0C32;&#x0C41;&#x0C17;&#x0C41;';  // fails
        ur = '&#x0627;&#x0631;&#x062F;&#x0648;';
        zh = '&#x4E2D;&#x6587;';

        • 1. Re: UTF-8 encoding in Stomp frames
          jmesnil

          you're right, it was a bug: if the frame body is a String it was not properly encoded with UTF-8.

          I've just fixed it  in the trunk (r8882).

           

          thanks for the heads up

          • 2. Re: UTF-8 encoding in Stomp frames
            timfox

            Actually the fix is not right.

             

            If a core message is of type text, then the data won't necessarily be encoded as UTF-8.

             

            The encoding is defined in ChannelBufferWrapper::readStringInternal, depending on the length it is encoded in different ways for optimal performance.

             

            I think the problem here is you are trying to mix and match STOMP and core messages without defining any proper mapping, and assuming a text message sent by core will available as a STOMP text message.

             

            Also I can't see anywhere in the STOMP protocol definition where it says the text is encoded as UTF-8 on the wire, for all we know it might just be ascii, or some other encoding.

            • 3. Re: UTF-8 encoding in Stomp frames
              jmesnil

              timfox wrote:

               

               

              Also I can't see anywhere in the STOMP protocol definition where it says the text is encoded as UTF-8 on the wire, for all we know it might just be ascii, or some other encoding.

              It's implied that UTF-8 is the default encoding: http://activemq.apache.org/stomp/stomp10/additional.html#character_encoding

              • 4. Re: UTF-8 encoding in Stomp frames
                timfox

                jmesnil wrote:

                 

                timfox wrote:

                 

                 

                Also I can't see anywhere in the STOMP protocol definition where it says the text is encoded as UTF-8 on the wire, for all we know it might just be ascii, or some other encoding.

                It's implied that UTF-8 is the default encoding: http://activemq.apache.org/stomp/stomp10/additional.html#character_encoding

                OK, so that's what the activemq guys have assumed, as it's ommitted from the spec, so let's go with that.

                 

                However, using the TEXT_TYPE is incorrect - this will fail if you try to consume a message that has been sent by a core client with a size < 9 or > 0xfff bytes

                 

                You need to define your own types for STOMP messages, not hijack the core types.

                 

                Also, this seems very convoluted:

                 

                byte[] content = frame.getContent();
                      if (type == Message.TEXT_TYPE)
                      {
                         message.getBodyBuffer().writeNullableSimpleString(SimpleString.toSimpleString(new String(content)));
                      }

                 

                Why not write the content directly in the buffer?

                 

                message.getBodyBuffer().writeBytes(content) ?

                • 5. Re: UTF-8 encoding in Stomp frames
                  jmesnil

                  I don't understand how it is different from JMS HornetQTextMessage "hijacking" the core TEXT_TYPE.

                   

                  The idea was to provide interoperability between Stomp messages and our Core/JMS messages:

                  - if the Stomp message has no content-length, treats its body as a String => convert it to a TEXT_TYPE core message so that we can consume it as a JMS TextMessage

                  - else treat it as a BYTES_TYPE, so we can consume it as a JMS BytesMessage

                   

                  Am I missing a more obvious way to do this?

                  • 6. Re: UTF-8 encoding in Stomp frames
                    mjustin

                    Hello Jeff,

                     

                    many thanks for the information, the last change causes problems with other languages (for example expected: <አማርኛ> but was: <አማáˆáŠ›> for the code sequence am = '&#x12A0;&#x121B;&#x122D;&#x129B;'). It is not a high priority for me at the moment and I see it is work in progress.

                     

                    Regards,

                    Michael

                    • 7. Re: UTF-8 encoding in Stomp frames
                      jmesnil

                      Also, this seems very convoluted:

                       

                      byte[] content = frame.getContent();
                            if (type == Message.TEXT_TYPE)
                            {
                               message.getBodyBuffer().writeNullableSimpleString(SimpleString.toSimpleString(new String(content)));
                            }

                       

                      Why not write the content directly in the buffer?

                       

                      message.getBodyBuffer().writeBytes(content) ?

                      This is for interoperability with JMS.

                       

                      If I was directly writing the content, the message would not be readable as a JMS TextMessage (which expects a nullable string from its

                      body buffer).

                      Either we keep this or we remove all this code and tells our user that they must use only JMS BytesMessage if they want to interact with messages

                      send/consumed by Stomp (not very friendly for a text-orientated protocol).

                       

                      I'd prefer to be able to use by default JMS TextMessage to interoperate with Stomp messages.

                      wdyt?

                      • 8. Re: UTF-8 encoding in Stomp frames
                        timfox

                        jmesnil wrote:

                         

                        I don't understand how it is different from JMS HornetQTextMessage "hijacking" the core TEXT_TYPE.

                         

                        The idea was to provide interoperability between Stomp messages and our Core/JMS messages:

                        - if the Stomp message has no content-length, treats its body as a String => convert it to a TEXT_TYPE core message so that we can consume it as a JMS TextMessage

                        - else treat it as a BYTES_TYPE, so we can consume it as a JMS BytesMessage

                         

                        Am I missing a more obvious way to do this?

                        Even if we were to provide some automatic transformation between jms text messages and stomp messages, which is not required to implement the STOMP protocol, then the way you have done it wouldn't work anyway. Strings are encoded in core messages in a more complex way than a NullableSimpleString, like I mentioned in a previous post.

                         

                        Let's get the basic STOMP protocol implemented first and we can think about implementing "extras" like mappings between stomp and jms later.

                        • 9. Re: UTF-8 encoding in Stomp frames
                          mjustin

                          Hi Tim,

                           

                          I agree here, Stomp-JMS mapping is nice to have but clients can also set a user defined property like 'content-type' to detect text or binary messages. In this case, the Stomp frame could always contain the content-length header, which does no longer indicate content type. Such a basic protocol implementation would be fine for most use cases. Until now I am very impressed by the Stomp transport and the ease of use of HornetQ.

                           

                          btw I would like to post a short announcement for my (commercial) Delphi and Free Pascal client library for HornetQ, would this be allowed in this forum?

                           

                          Regards,

                          Michael

                          • 10. Re: UTF-8 encoding in Stomp frames
                            timfox

                            mjustin wrote:


                             

                            btw I would like to post a short announcement for my (commercial) Delphi and Free Pascal client library for HornetQ, would this be allowed in this forum?

                             

                            Regards,

                            Michael

                            Sure, I don't mind

                            • 11. Re: UTF-8 encoding in Stomp frames
                              jmesnil

                              ok, i'll remove all the code I added to provide JMS interop and add a JIRA issue for JMS/Stomp interop in a next release.

                               

                              I'll also rewrite the StompTest (they were using mixed JMS and Stomp messages).

                              Once this and the frame decoder code is done, the task should be finished.

                              • 12. Re: UTF-8 encoding in Stomp frames
                                timfox
                                You can keep a test in there that validates a STOMP message can be received as JMS bytes message containing the UTF-8 encoded bytes, and a JMS BytesMessage can be received as STOMP message containing those bytes, which would be the default behaviour in the absence of any more complex stomp<->jms mapping.