3 Replies Latest reply on May 8, 2008 9:29 AM by dmlloyd

    To UTF-8 or not to UTF-8 that is the question

    timfox

      JBM 1.4 uses UTF-8 encoding for all strings sent in messages, e.g. properties, text message bodies etc.

      This provides a good compression if using higher unicode characters a lot (e.g. chinese), however the java UTF-8 encoding is *really slow*.

      For JBM 2.0 we're currently the SimpleString class I wrote (which doesn't copy itself on the drop of a hat like String) and we marshall it as a simple sequence of bytes.

      In my tests this is about 40 times faster than UTF-8 encoding the same string. :)

      Problem is SimpleString currently only stores each character as two bytes, which is fine for the vast majority of unicode characters but won't encode the far reaches of unicode which require 4 bytes.

      I can change SImpleString to use 4 bytes per character but this is going to make the marshalled form big - especially in the case of standard latin characters or european - about 4 times the size as encoded!

      How do you think we should deal with this?

      One possibility is we write our own UTF-like encoding implementation...