2 Replies Latest reply on Feb 12, 2013 9:44 AM by kattaw

Escape Characters Being Unescaped in UTF-8 to ISO-8859-1 conversion

kattaw Feb 7, 2013 4:26 PM

In short, some XML special characters are being unescaped when some CFX code alters the encoding from UTF-8 to ISO-8859-1 in the process of returning a Soap response. We need to maintain a UTF-8 encoding.

We’re seeing this issue in the context of a Jax-WS, and we’re using Jboss 6. The cxf version we are using is 2.5.1 and it is being pulled as a Maven dependency. Unfortunately, we do not have the option of moving to Jboss 7, so we’d really like to find a solution with our current version.

We are attempting to send an XML document in a Soap response as an MTOM attachment, and it is a business requirement that the UTF-8 encoding of the document be maintained through the process and returned unchanged. The desired encoding is set in the XML header (<?xml version="1.0" encoding="UTF-8"?>), and we have attempted to add the specified charset to the @XmlMimeType annotation on the field in question, with no change in behavior.

While debugging through the cxf code, we noticed that a second message response variable is maintained, separate from the one we constructed. This second variable contains exactly our response, with one change: its encoding is set to ISO-8859-1. At some point before we actually return the response, our originally constructed message (UTF-8) is replaced with this newly created and wrongly encoded message (ISO-8859-1). In our web.xml file, however, we have created an encodingFilter (of class org.springframework.web.filter.CharacterEncodingFilter) which sets the encoding to UTF-8. Our response passes through this filter before being returned, so while we do technically return something UTF-8 encoded, the message has already been altered during the UTF-8 to ISO-8859-1 conversion. During that conversion, some of the XML special characters become unescaped.

Specifically, & gt; becomes <, & quot; becomes “, and & apos; becomes ‘. The other two special XML characters (& lt; and & amp;) remain in their original form, which we consider correct.

Additionally we also tried using the cxf-api jar in the common-/lib folder of jboss instead of getting it from maven, but the result remained unchanged.

In short, we are looking for some way to entirely avoid the UTF-8 to ISO-8859-1 conversion that occurs (potentially) in the InvokerJSE.invoke method. Is there perhaps a configuration we’re missing? All of our searches for one thus far have been fruitless. Any input would be very helpful.

Example:

Original Message: <root> hello & gt; world </root>

Expected Output: <root> hello & gt; world </root>

Actual Output: <root> hello > world </root>

PLEASE NOTE: Throughout this post, I have spread out the escape sequences to prevent them from being unescaped when this question is posted. Ordinarily, all characters of the escape sequence are together without spaces.

1. Re: Escape Characters Being Unescaped in UTF-8 to ISO-8859-1 conversion

asoldano Feb 12, 2013 3:36 AM (in response to kattaw)

Did you perhaps see https://community.jboss.org/wiki/AS6JBossWSCXFAndMTOM ? Is that the same scenario as yours (or similar to yours)?
Actions
2. Re: Escape Characters Being Unescaped in UTF-8 to ISO-8859-1 conversion

kattaw Feb 12, 2013 9:44 AM (in response to asoldano)

I hadn't seen this post, so thank you for your response! Unfortunately, the author of that post appears to be experiencing the same problem we are, only it's alright for his purposes. In his example, the tag
"<some> > stuff < </some>"
becomes
"<some> & gt; stuff & lt; </some>"
in the Soap response, and that is precisely what we're trying to avoid. We need the output to be exactly identical to the input, regardless of whether the input is literally a greater-than sign or the escaped version. For example, if the input is an ampersand, the output needs to be an ampersand. If the input is "& amp;" the output needs to be "& amp;"
Thanks again, though.
Actions

Go to original post