3 Replies Latest reply on Jan 18, 2007 1:50 AM by flarosa

    Unicode character issue - happens only on Linux

    flarosa

      Hi,

      I have a customer who regularly cuts text from Word documents before pasting them into forms I created for him on his web site. The text often contains non-UTF-8 characters such as u2019 for single quotes or u201C for double-quotes. We were having some problems storing these characters in our database, so I added a filter that replaces them with the standard quotes from the UTF-8 set.

      I tested my work by deploying to a local copy of JBoss on my workstation, which is a Windows XP computer, and it worked fine. I did the conversion using the String.replace function, for example:

      s = s.replace('\u201C', '"');

      However, when I deployed this to my production environment - which has the same version of Java, and the same version of JBoss, but is Linux - it failed. To see what was going on, I tried logging all the characters of the input string using s.codePointAt(). It turns out that instead of getting characters 201C and 2019, I'm getting character FFFD in both cases.

      Does anyone understand why this is happening? I have been working with Java for almost 7 years, and I have never encountered an inconsistency between its behavior on Linux and Windows before.

      Thanks,
      Frank

        • 1. Re: Unicode character issue - happens only on Linux
          flarosa

          I should add that the text is coming from a multipart form and I'm using the commons-fileupload library to parse the form.

          • 2. Re: Unicode character issue - happens only on Linux
            genman

            You might need to set the platform character set for Java to UTF-8 when you start.

            • 3. Re: Unicode character issue - happens only on Linux
              flarosa

              Thanks - this was almost the solution. In fact, UTF-8 is the default encoding on Linux and something called "Cp1252" - which isn't even documented as a Java recognized encoding - is the default on Windows.

              Since the commons fileupload class has a method for decoding with a specified character encoding, I solved the problem by explicitly passing Cp1252 as the encoding method and now I'm getting the behavior I want on either system.

              I still haven't figured out how to do the same thing for regular (non-multipart) forms. There's a function on the Request object called setCharacterEncoding, but it had no effect when I tried it.

              Frank