3 Replies Latest reply on May 8, 2008 9:29 AM by dmlloyd

To UTF-8 or not to UTF-8 that is the question

timfox May 8, 2008 9:03 AM

JBM 1.4 uses UTF-8 encoding for all strings sent in messages, e.g. properties, text message bodies etc.

This provides a good compression if using higher unicode characters a lot (e.g. chinese), however the java UTF-8 encoding is *really slow*.

For JBM 2.0 we're currently the SimpleString class I wrote (which doesn't copy itself on the drop of a hat like String) and we marshall it as a simple sequence of bytes.

In my tests this is about 40 times faster than UTF-8 encoding the same string. :)

Problem is SimpleString currently only stores each character as two bytes, which is fine for the vast majority of unicode characters but won't encode the far reaches of unicode which require 4 bytes.

I can change SImpleString to use 4 bytes per character but this is going to make the marshalled form big - especially in the case of standard latin characters or european - about 4 times the size as encoded!

How do you think we should deal with this?

One possibility is we write our own UTF-like encoding implementation...

1. Re: To UTF-8 or not to UTF-8 that is the question

ataylor May 8, 2008 9:24 AM (in response to timfox)

can we switch between using 2 or 4 bytes depending on the characters
Actions
2. Re: To UTF-8 or not to UTF-8 that is the question

timfox May 8, 2008 9:26 AM (in response to timfox)

"ataylor" wrote:
can we switch between using 2 or 4 bytes depending on the characters

yes, that is what I mean with my comment that would "could do our own UTF-like encoding".

The issue is getting that fast.
Actions
3. Re: To UTF-8 or not to UTF-8 that is the question

dmlloyd May 8, 2008 9:29 AM (in response to timfox)

Just use UTF-16 encoding. You basically get this for free if you are just writing out chars.
Actions

Go to original post