-
1. Re: Problem in retrieving WSDL from remote endpoint
vitaliylu Mar 24, 2010 4:16 AM (in response to vitaliylu)Hi all !
I want to know have i a right in decision?
in a HttpGatewayServlet class:
protected void service(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
// if it's a wsdl request, serve up the contract then short-circuit
if ("wsdl".equalsIgnoreCase(req.getQueryString())) {
resp.setCharacterEncoding("UTF-8");
Charset charset = Charset.forName("UTF-8"); // add
CharsetEncoder chr = charset.newEncoder(); //add
String mimeType = (contract != null ? contract.getMimeType() : "text/xml");
resp.setContentType(mimeType);
String data = (contract != null ? contract.getData() : "<definitions/>");
ByteBuffer bbuf = chr.encode(CharBuffer.wrap(data)); //add
resp.setContentLength(bbuf.capacity()); // this is problem place i change length buf
Writer writer = new BufferedWriter(resp.getWriter());
writer.write(data);
writer.flush();
return;
} -
2. Re: Problem in retrieving WSDL from remote endpoint
dward Apr 13, 2010 5:07 PM (in response to vitaliylu)Vitaliy,
- Thank you for bringing this to our attention.
- In the future, please use the JBoss ESB user forum, not the developer forum.
- I have opened up a Jira item to get this fixed: JBESB-3279 .
Warm regards,
David
-
3. Re: Problem in retrieving WSDL from remote endpoint
dward Apr 27, 2010 5:12 PM (in response to dward)1 of 1 people found this helpfulThe problem is more pervasive than I realized, but I do have a fix. There are a couple issues that need to be discussed, however, so in the end it is a good thing that this is in the developer forum.
First, the fix:
- StreamUtils.readStreamString(stream, charset):String was incorrectly implemented. It just read in the bytes and created a new String object with the bytes and the specified charset. This will not convert the bytes into the specified charset. So, if the bytes you are reading in are character encoded with KOI8-R instead of UTF-8, for example, then when you output that String, the characters will be garbled. (For example, in Firefox, they will become question marks.) Unfortunately, there is nothing in the JDK that can determine what character encoding a set of bytes or stream is, and this is paramount to creating a String that can be outputted correctly. There is InputStreamReader, but you have to know the encoding of the stream first. Luckily, there is a utility out there we can use: ICU4J. What I did to fix the method was call upon a helper method in CharsetDetector.
- Next, the HttpGatewayServlet (which is sometimes used for exposing WSDL via <http-gateway/>, for example, for SOAPProxy) had to be changed to get the now UTF-8 encoded bytes, and output that, making sure to set the Content-Length response header to the length of the byte array, not the length of the String: http://mark.koli.ch/2009/09/remember-kids-an-http-content-length-is-the-number-of-bytes-not-the-number-of-characters.html
- Finally, the contract.jsp page (which is sometimes used for exposing WSDL via <jbr-provider/>) had to be changed to declare <%@page contentType="text/xml; charset=UTF-8"%> at the top of the page, so that <%=contractData=> would be outputted correctly. See notes on JSP page directive and its effect here: http://www.w3.org/International/O-HTTP-charset
After making the changes above, I tested successfully using both JDK 5 + ESB in AS4, and JDK 6 + ESB in AS 5. I also ran a clean integration build.
Next, the two issues that have to be discussed:
- Are we allowed to add icu4j-4_4.jar and icu4j-charsets-4_4.jar as project dependencies, and ship them with both our community ESB and supported SOA Platform? I can't seem to find out what the ICU4J license is...
- While correcting the implementation of StreamUtils.readStreamString(stream, charset):String, any code that called that method will now be guaranteed correct character encoding. However, any code which calls the underlying StreamUtils.readStream(stream):byte[] method, and decides to create it's own String from the returned byte array (instead of just using readStreamString) could have a misinterpretted character encoding problem. This is all over the place in our code, and I'm a bit hesitant to go change all of it. Maybe for the purpose of this bug, we just stick to fixing the WSDL encoding problem? Luckily, this problem only rears its head if the stream being read contains an extended character set and not UTF-8 (like KOI8-R or cp1251, for example).
Feedback appreciated. Thanks!
-
4. Re: Problem in retrieving WSDL from remote endpoint
dward Apr 27, 2010 5:34 PM (in response to dward)Okay, duh. The license information was staring right at me on their homepage:
"ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software"
Are we okay with including it in our stuff?
-
5. Re: Problem in retrieving WSDL from remote endpoint
kcbabo Apr 28, 2010 9:48 AM (in response to dward)Couple of dumb questions:
I looked at the example WSDL and the encoding declaration in the XML is UTF-8, not KOI8-R. I'm not an expert on KOI8-R, but a little googling leads me to believe that it's not UTF-8 compatible (natch, based on your findings). Is there a reason why KOI8-R is not declared as the encoding for the document? You can use any XML parser to suck up the content in that case.
When retieving the WSDL via HTTP, is the Content-Type field providing a charset parameter? If so, you can use that to decode apporpriately.
cheers,
`k.
-
6. Re: Problem in retrieving WSDL from remote endpoint
mageshbk Apr 28, 2010 9:56 AM (in response to kcbabo)> StreamUtils.readStreamString(stream, charset):String was incorrectly implemented.
Well it is correctly implemented, it is meant to read bytes to String using the passed encoding and not for conversion.
> Unfortunately, there is nothing in the JDK that can determine what character encoding a set of bytes or stream is, and this is paramount to creating a String that can be outputted correctly.
Yes, that is why most applications do not try to guess the encoding themselves. Guessing/Intelligently identifying the encoding could pose problems, for e.g.,
"The user application may be expecting only the specified encoding, whereas due to intelligent scanning we might read all encoded streams"
Will this lead to Security issue? I do not know ATM.
> Maybe for the purpose of this bug, we just stick to fixing the WSDL encoding problem?
I would do this. A setting in the SOAPProxy action like this would be sufficient for downloading and converting the file.
<action name="proxy" ..> <property name="wsdl-encoding" value="ISO-8859-1" /> </action>
But the major setback is this[1] as the BasicProfile mandates the WSDL and SOAP be only UTF-8 or UTF-16.
One more thing is that the attached ItemService.wsdl is actually "ISO-8859-1" encoded file, but just has been specified as "UTF-8":
This type of WSDL should not be allowed according to BP!
[1] http://www.ws-i.org/Profiles/BasicProfile-1.0-2004-04-16.html#refinement16527096
-
7. Re: Problem in retrieving WSDL from remote endpoint
dward Apr 28, 2010 10:02 AM (in response to kcbabo)Keith,
Yes, the original example WSDL declares UTF-8, incorrectly. However, my mention of KOI8-R above is not directly related to the original example WSDL. A fellow Red-Hatter sent me various other Russian WSDL examples, with encodings in CP-1251, KOI8-R and UTF-8, and that's what I've been testing with and got working correctly by reading the byte stream into a String using ICU4J. My change also handles the situation of the original example WSDL, incorrect declaration and all.
There is shared code for retrieving the WSDL, so I don't want to do something different for http:// (looking for the Content-Type field) than I do for reading in from file:// or classpath://. Besides, the Content-Type field could be wrong, just as it we saw the declaration was wrong in the original example WSDL.
David
-
8. Re: Problem in retrieving WSDL from remote endpoint
dward Apr 28, 2010 10:18 AM (in response to mageshbk)Magesh,
I disagree that you believe StreamUtils.readStreamString(stream, charset):String is correctly implemented. It creates a String with the passed bytes and passed encoding, however the passed bytes contain character data which might not be in the passed encoding! Using ICU4J can fix this.
Specifically, change this:
public static String readStreamString(InputStream stream, String charset) throws UnsupportedEncodingException {
return new String(StreamUtils.readStream(stream), charset);
}
to this:
public static String readStreamString(InputStream stream, String charset) {
return new CharsetDetector().getString(StreamUtils.readStream(stream), charset);
}Regarding the statement: "The user application may be expecting only the specified encoding, whereas due to intelligent scanning we might read all encoded streams", I do worry about this a bit from a performance perspective (although it has not been substantiated yet), however I do not worry about it as a security issue. Either way we're building up a String. I should also note that I didn't change any methods that just read an InputStream into a byte[] array. I only changed the methods readStreamString and getResourceAsString - the two utility methods that want those bytes as a String, and not just as a data-bucket.
I would be okay with just sticking to fixing the WSDL encoding problem, which would mean removing my change to StreamUtils and convert the characters "closer" to the WSDL handling code, however:
- I wouldn't add a property called "wsdl-encoding" to any Action. Again, we would be depending on the user to specificy it correctly, which we've already seen here they don't do.
- What do we do about all the other areas of our code that need to do String handling/outputting that are assuming the bytes contain characters encoded as UTF-8, but they're not???
I think the thing that intrigues me the most about your reply is that non-UTF-8 WSDL should not be allowed according to Basic Profile. This could change everything. I have to admit I was ignorant of this. Now, should we:
- Say "all WSDL must adhere to Basic Profile so we are rejecting this jira." - Seems a bit harsh, but might be acceptable.
- Understand that many people (myself included up until now) don't know about this, or don't want to adhere to it, so we should just handle the possibility of non-UTF-8 or UTF-16 encoded documents just in case?
Thanks for the feedback,
David
-
9. Re: Problem in retrieving WSDL from remote endpoint
kconner Apr 28, 2010 10:24 AM (in response to dward)As Magesh points out, the readStreamString method is implemented correctly. It is not intended to transform a stream into a named character set but to read in a stream already in that character set. How we determine that character set seems to be the issue.
I don't know enough about internationalised character sets to know whether 'guessing' is guaranteed to work. My guess would be that it isn't and that any utility is a 'best attempt' rather than a guarantee, but we should definitely check that out.
If we cannot rely on a Content-Type header or equivalent, either because the value is incorrect or cannot be determined from the XML, then I think it would be safer for us to provide a fallback mechanism and allow the type to be specified/overridden (as Magesh sugegsted) rather than trying to guess it.
Kev
-
10. Re: Problem in retrieving WSDL from remote endpoint
kconner Apr 28, 2010 10:27 AM (in response to dward)That method creates a String with the encoding that it has been told is present. The fact that we are giving it the *wrong* encoding, for whatever reason, does not make that implementation incorrect.
I would be interested to find out more about how ICU4J determines the character set, especially if we can see that it is guaranteed to get it right every time.
Kev
-
11. Re: Problem in retrieving WSDL from remote endpoint
kconner Apr 28, 2010 10:42 AM (in response to kconner)It looks like the invocation of that method seems to be more of an issue, both locations hard code it to be "UTF-8".
Kev
-
12. Re: Problem in retrieving WSDL from remote endpoint
dward Apr 28, 2010 10:47 AM (in response to kconner)Well, if you look at everywhere that calls that method, our code consistently hardcodes "UTF-8". So maybe we should question why are we passing in a charset anyway?
Regarding determining the charset, I have done a lot of reading on this, and you can't always depend on magic bytes to tell you the encoding, so some amount of the bytes have to be reviewed and matched with conversion tables to find a match. On ICU4J's homepage, "ICU's conversion tables are based on charset data collected by IBM over the course of many decades, and is the most complete available anywhere." I would be likely to believe this, as I had read elsewhere that many of the I18N code in JDK >=1.1 came from them and was actually fed directly into the Java language via partnership. "IBM and the ICU team played a key role in providing globalization technology into Sun's Java".
-
13. Re: Problem in retrieving WSDL from remote endpoint
dward Apr 28, 2010 10:53 AM (in response to kconner)Please refer to my response above as far as determining the charset is concerned. If you look inside icu4j-charsets-4.4.jar, they have 176 conversion tables!
I'm not so sure I like the fallback mechanism, but if the team believes that is the best way to go, I will bend. Of course, that path opens up another couple possibilities:
- Do we have to change contract.jsp and HttpGatewayServlet to stream the wsdl declaring that charset instead, or...
- Do we still use something like ICU4J to convert that data to UTF-8 before we stream it, that way at least we are adhering to Basic Profile when we are outputting our WSDL?
I tend to prefer option 2.
-
14. Re: Problem in retrieving WSDL from remote endpoint
kconner Apr 28, 2010 10:58 AM (in response to dward)Sorry, but they can still have 176 conversion tables and still not be able to guarantee conversion. Hence my question.
And if ICU4J cannot guarantee the detection then it will always be suspect.
Kev