1 Reply Latest reply on Nov 5, 2012 10:32 PM by bondchan921

    Character encoding changes in JBoss 5.1: UTF-8 vs ISO-8859-1 - how to handle?

    mkantor

      Upgrading from version 4.* to 5.1, I notice that some pages containing characters outside the regular ASCII range break - specifically, the http response from JBoss gets truncated mid-page.  (These characters are being read from SQL).

       

       

      I believe this has to do with character encoding. I have found references to the following two settings, which do not fix the problem:

       

      1.  In <JBOSS-ROOT>server\default\deploy\jbossweb.sar\server.xml, settings URIEncoding="UTF-8" in <Connector> elements. 

       

      2. In run.bat (or equivalent startup file), setting file encoding:   set "JAVA_OPTS=-Dfile.encoding=utf-8 %JAVA_OPTS%"

       

      After both these changes, http responses continue to include the header ContentType = "text/html;chartset=ISO-8859-1", and continue to have the truncation problem when characters outside the ASCII range are included.

       

       

       

      I have the following solution which DOES solve the immediate problem, but I don't fully understand why, and am not confident in its correctness:

       

      I wrote a servlet filter that ensures the output of text/html content pages is UTF-8 encoded:

       

       

       

      public class EncodingFilter implements Filter {
         ...
      
          /**
          * Set the character encoding for request. Wrap the response and set character encoding
          * on the return trip.
          */
          public void doFilter(ServletRequest req, ServletResponse resp, FilterChain chain) throws IOException, ServletException {
              
              resp.setCharacterEncoding(encoding);
              req.setCharacterEncoding(encoding);
              
              // Create a wrapper around the response, so we can intercept it and change it later
              CharResponseWrapper wrapper = new CharResponseWrapper( (HttpServletResponse) resp);
              
              // Now let the request go through other filters and the servlet
              chain.doFilter(req, wrapper);
              
              PrintWriter respStream = resp.getWriter();
              
              // if content type is text/html, set character encoding
              if(wrapper.getContentType().substring(0,9).equals("text/html") && !wrapper.getContentType().contains("UTF-8")) {
                  ((HttpServletResponse) resp).setHeader("Content-Type", "text/html;charset=" + encoding);
                  ((HttpServletResponse) resp).setHeader("X-Wrapper-Encoding", "text/html;charset=" + wrapper.getCharacterEncoding());
                          
                  // Now transfer the content from the wrapper to the response
                  respStream.write(wrapper.toString());    
                  
              } else {
                  // Not text/html, so we just want the plain character stream
                  wrapper.writeToStream(respStream);
              }
          }
      

       

      The variable 'encoding' is read form web.xml, and has value "UTF-8".  The CharResponseWrapper uses a CharArrayWriter to capture the response, and exposes its writeToStream and ToString methods.

       

      As I said, this solves the immediate problem. The responses now say content type is "text/html;charset=UTF-8", and the pages are not truncated. Randomly sampled characters in the ASCII 128-255 range appear correctly in the browser.  What I don't understand is:

       

      1. What could be choking on the non-ASCII characters?

      2. Why does changing the encoding help?

      3. What changed about JBoss between 4.* and 5.1 to cause this problem?

      4. Is the filter a correct solution to the problem?

      5.  Is there a better solution?

       

      Any help is much appreciated.