12 Replies Latest reply on Feb 18, 2010 11:26 AM by timfox

UTF-8 encoding in Stomp frames

mjustin Feb 16, 2010 10:04 AM

Sending and receiving of messages with Unicode characters (UTF-8 encoded) gives some unexpected results.

(Later I will try a Java JMS client to consume the message to find if the problem is the decoding or the encoding on the server side.)

For five of 12 languages there is a mismatch between the sent content and the received, for example:

expected: <ગુજરાતી> but was: <ગ�?જરાતી>

The test data uses these language names:

am = 'አማርኛ';
ar = 'العربية';
gu = 'ગુજરાતી'; // fails
he = 'עברית';
hi = 'हिन्दी'; // fails
kn = 'ಕನ್ನಡ'; // fails
mr = 'मराठी';
ja = '日本語';
ta = 'தமிழ்'; // fails
te = 'తెలుగు'; // fails
ur = 'اردو';
zh = '中文';

1. Re: UTF-8 encoding in Stomp frames

jmesnil Feb 16, 2010 10:33 AM (in response to mjustin)

you're right, it was a bug: if the frame body is a String it was not properly encoded with UTF-8.
I've just fixed it in the trunk (r8882).

thanks for the heads up
Actions
2. Re: UTF-8 encoding in Stomp frames

timfox Feb 16, 2010 10:46 AM (in response to jmesnil)

Actually the fix is not right.

If a core message is of type text, then the data won't necessarily be encoded as UTF-8.

The encoding is defined in ChannelBufferWrapper::readStringInternal, depending on the length it is encoded in different ways for optimal performance.

I think the problem here is you are trying to mix and match STOMP and core messages without defining any proper mapping, and assuming a text message sent by core will available as a STOMP text message.

Also I can't see anywhere in the STOMP protocol definition where it says the text is encoded as UTF-8 on the wire, for all we know it might just be ascii, or some other encoding.
Actions
3. Re: UTF-8 encoding in Stomp frames

jmesnil Feb 16, 2010 10:49 AM (in response to timfox)

timfox wrote:

Also I can't see anywhere in the STOMP protocol definition where it says the text is encoded as UTF-8 on the wire, for all we know it might just be ascii, or some other encoding.
It's implied that UTF-8 is the default encoding: http://activemq.apache.org/stomp/stomp10/additional.html#character_encoding
Actions
4. Re: UTF-8 encoding in Stomp frames

timfox Feb 16, 2010 10:55 AM (in response to jmesnil)

jmesnil wrote:

timfox wrote:

Also I can't see anywhere in the STOMP protocol definition where it says the text is encoded as UTF-8 on the wire, for all we know it might just be ascii, or some other encoding.
It's implied that UTF-8 is the default encoding: http://activemq.apache.org/stomp/stomp10/additional.html#character_encoding
OK, so that's what the activemq guys have assumed, as it's ommitted from the spec, so let's go with that.

However, using the TEXT_TYPE is incorrect - this will fail if you try to consume a message that has been sent by a core client with a size < 9 or > 0xfff bytes

You need to define your own types for STOMP messages, not hijack the core types.

Also, this seems very convoluted:

byte[] content = frame.getContent();
      if (type == Message.TEXT_TYPE)
      {
         message.getBodyBuffer().writeNullableSimpleString(SimpleString.toSimpleString(new String(content)));
      }

Why not write the content directly in the buffer?

message.getBodyBuffer().writeBytes(content) ?
Actions
5. Re: UTF-8 encoding in Stomp frames

jmesnil Feb 17, 2010 8:57 AM (in response to timfox)

I don't understand how it is different from JMS HornetQTextMessage "hijacking" the core TEXT_TYPE.

The idea was to provide interoperability between Stomp messages and our Core/JMS messages:
- if the Stomp message has no content-length, treats its body as a String => convert it to a TEXT_TYPE core message so that we can consume it as a JMS TextMessage
- else treat it as a BYTES_TYPE, so we can consume it as a JMS BytesMessage

Am I missing a more obvious way to do this?
Actions
6. Re: UTF-8 encoding in Stomp frames

mjustin Feb 17, 2010 10:24 AM (in response to jmesnil)

Hello Jeff,

many thanks for the information, the last change causes problems with other languages (for example expected: <አማርኛ> but was: <áŠ áˆ›áˆáŠ›> for the code sequence am = 'አማርኛ'). It is not a high priority for me at the moment and I see it is work in progress.

Regards,
Michael
Actions
7. Re: UTF-8 encoding in Stomp frames

jmesnil Feb 18, 2010 9:56 AM (in response to timfox)

Also, this seems very convoluted:

byte[] content = frame.getContent();
      if (type == Message.TEXT_TYPE)
      {
         message.getBodyBuffer().writeNullableSimpleString(SimpleString.toSimpleString(new String(content)));
      }

Why not write the content directly in the buffer?

message.getBodyBuffer().writeBytes(content) ?
This is for interoperability with JMS.

If I was directly writing the content, the message would not be readable as a JMS TextMessage (which expects a nullable string from its
body buffer).
Either we keep this or we remove all this code and tells our user that they must use only JMS BytesMessage if they want to interact with messages
send/consumed by Stomp (not very friendly for a text-orientated protocol).

I'd prefer to be able to use by default JMS TextMessage to interoperate with Stomp messages.
wdyt?
Actions
8. Re: UTF-8 encoding in Stomp frames

timfox Feb 18, 2010 10:30 AM (in response to jmesnil)

jmesnil wrote:

I don't understand how it is different from JMS HornetQTextMessage "hijacking" the core TEXT_TYPE.

The idea was to provide interoperability between Stomp messages and our Core/JMS messages:
- if the Stomp message has no content-length, treats its body as a String => convert it to a TEXT_TYPE core message so that we can consume it as a JMS TextMessage
- else treat it as a BYTES_TYPE, so we can consume it as a JMS BytesMessage

Am I missing a more obvious way to do this?
Even if we were to provide some automatic transformation between jms text messages and stomp messages, which is not required to implement the STOMP protocol, then the way you have done it wouldn't work anyway. Strings are encoded in core messages in a more complex way than a NullableSimpleString, like I mentioned in a previous post.

Let's get the basic STOMP protocol implemented first and we can think about implementing "extras" like mappings between stomp and jms later.
Actions
9. Re: UTF-8 encoding in Stomp frames

mjustin Feb 18, 2010 11:00 AM (in response to timfox)

Hi Tim,

I agree here, Stomp-JMS mapping is nice to have but clients can also set a user defined property like 'content-type' to detect text or binary messages. In this case, the Stomp frame could always contain the content-length header, which does no longer indicate content type. Such a basic protocol implementation would be fine for most use cases. Until now I am very impressed by the Stomp transport and the ease of use of HornetQ.

btw I would like to post a short announcement for my (commercial) Delphi and Free Pascal client library for HornetQ, would this be allowed in this forum?

Regards,
Michael
Actions
10. Re: UTF-8 encoding in Stomp frames

timfox Feb 18, 2010 11:16 AM (in response to mjustin)

mjustin wrote:

btw I would like to post a short announcement for my (commercial) Delphi and Free Pascal client library for HornetQ, would this be allowed in this forum?

Regards,
Michael
Sure, I don't mind
Actions
11. Re: UTF-8 encoding in Stomp frames

jmesnil Feb 18, 2010 11:19 AM (in response to mjustin)

ok, i'll remove all the code I added to provide JMS interop and add a JIRA issue for JMS/Stomp interop in a next release.

I'll also rewrite the StompTest (they were using mixed JMS and Stomp messages).
Once this and the frame decoder code is done, the task should be finished.
Actions
12. Re: UTF-8 encoding in Stomp frames

timfox Feb 18, 2010 11:26 AM (in response to jmesnil)

You can keep a test in there that validates a STOMP message can be received as JMS bytes message containing the UTF-8 encoded bytes, and a JMS BytesMessage can be received as STOMP message containing those bytes, which would be the default behaviour in the absence of any more complex stomp<->jms mapping.
Actions

Go to original post