12 Replies Latest reply on Apr 13, 2005 12:28 PM by anil.saldhana

    JBossWS Streaming Implementation Proposal

    jason.greene

      Hello everyone,

      After much thought, I was able to narrow everything down to one design, which I think is the best solution. It works off a similar concept to the XML fragment design, though it does introduce a lot of changes to the existing code base.

      First, I will start with a bit of background on StAX. StAX consists of two APIs (cursor, and event). The cursor API consists of 2 primary interfaces (XMLStreamReader, and XMLStreamWriter). The cursor API is forward only, and all functionality is accessed via that interface. As the cursor is advanced, an event is returned that corresponds to a valid token set encountered by the parser (i.e. START_ELEMENT, CHARACTERS, COMMENT, etc). The consumer then calls the desired accessor methods that are associated with the event.

      The event API operates similar to the cursor API, except that it allocates and returns an event object whos hierarchy is based off of the event type. The event object can be indefinitely held, which makes it ideal for pipelining. There are 2 main interfaces that a consumer uses to interact with the event API, XMLEventReader and XMLEventWriter.

      I will only describe the process from an unmarshalling perspective, since the marshalling process is reflexive.

      For unmarshalling, this would involve a front message parser that would use the StAX cursor API (XMLStreamReader) to pull from the incoming message stream and analyze each element in the order that it occurs. Based off of the typemapping registry, a deserializer would be passed the XMLStreamReader at a START_ELEMENT event. The deserializer would then construct the appropriate object by lazily pulling from the parser until it hits the corresponding END_ELEMENT. The front message parser would then continue to the next START_ELEMENT that needs to be delegated. The JAXB spec already provides such a concept in its JAXBContext interface. (When passed an XMLStreamingReader, it expects it to be positioned at a START_ELEMENT, and advances to the corresponding END_ELEMENT).

      Now, I know what you are thinking, what about SAAJ? We know in advance if there is a handler registered. If there is one, it is unavoidable that we must convert our incoming stream into a DOM tree, if there isn't one, and there are attachments, we just mime decode the stream on the fly and process the XML portion ignoring the attachments. Assuming there was a handler, and after the message is manipulated (or perhaps not) by the handler, the message is deserialized into our unmarshalling component as described above. We take a hit here in reparsing a message we just processed, but IMO this is far better than the alternative of maintaining 2 code paths.

      The main problem to a streaming parser implementation is that stream parsing and SAAJ are mutually exclusive. Which is why I also propose that we add a proprietary enhancement to the protocol handler's SOAPMessageContext that would allow the handler to obtain an XMLEventReader and XMLEventWriter (or XMLStreamWriter/XMLStreamReader) pair. Regardless, the interfaces would be emulated such that a dispatch component could then pipeline XMLEvent objects into and out of each handler in the chain, the last handler being the front message parser itself. Each push and pull operation on the reader/writer would pipe chucks to the next handler. The handler would only push if it could pipeline. So if, for example, the handler needed to process the entire message before it modified it, it would just queue, and hold off pushing till the end. If a handler still wanted to use SAAJ, we would just lazily construct it when the handler called getSOAPMessage(). I emailed this idea to the jax-rpc comments address, and I got a response saying that the expert group would look into this, so it is potentially possible to become part of the standard.

      Adding all of these pieces together, you end up with the ability to parse the SOAP message from a stream with many handlers only once, and with no large copies.

      -Jason

        • 1. Re: JBossWS Streaming Implementation Proposal
          aloubyansky

          Are you talking about JAXB2 here? Where can I find the info about JAXBContext using XMLStreamingReader?
          Thanks.

          • 2. Re: JBossWS Streaming Implementation Proposal
            jason.greene

             

            "alex.loubyansky@jboss.com" wrote:
            Are you talking about JAXB2 here? Where can I find the info about JAXBContext using XMLStreamingReader?
            Thanks.


            Yes JAXB2. If you look at the EDR javadoc, the Marshaller/Unmarshaller interfaces which are pulled from the JAXBContext, have overloaded marshall, and unmarshall methods which support both XMLStreamReader/XMLStreamWriter and XMLEvenetReader/XMLEventWriter.

            Here is an unmarshalling example using the stream reader
            XMLStreamReader xmlStreamReader =
             XMLInputFactory().newInstance().createXMLStreamReader( ... );
            JAXBContext jc = JAXBContext.newInstance( "com.acme.foo" );
            Unmarshaller u = jc.createUnmarshaller();
            Object o = u.unmarshal( xmlStreamReader );
            


            And here is a marshalling example
            ComplexObject obj = ...;
            XMLStreamWriter xmlStreamWriter =
             XMLOutputFactory.newInstance().createXMLStreamWriter( ... );
            JAXBContext jc = JAXBContext.newInstance( "com.acme.foo" );
            Marshaller m = jc.createMarshaller();
            m.marshal( obj, xmlStreamWriter );
            


            Thanks,
            -Jason





            • 3. Re: JBossWS Streaming Implementation Proposal
              jason.greene

              Also, they claify the pulling/pushing behavior in the javadoc, which correpsonds with my design proposal.

              From the javadoc:

              unmarshal
              
              java.lang.Object unmarshal(javax.xml.stream.XMLStreamReader reader)
               throws JAXBException
              
               Unmarshal XML data from the specified pull parser and return the resulting content tree.
              
               This method assumes that the parser is at a start element event, and the unmarshalling will be done from this start element to the corresponding end element. If this method returns successfully, the reader will be pointing at the token right after the end element.
              


              Thanks,
              -Jason

              • 4. Re: JBossWS Streaming Implementation Proposal
                thomas.diesler

                Hi Jason,

                What you describe here is the transition from

                InputStream -> StAX -> JAXRPC Deserializer -> Java Object
                


                Let me rehears what we've currently got:

                Fragment incomming XML message (done in MessageFactory.createMessage()) and create a flat SAAJ view

                Document Style
                --------------

                SOAPEnvelope
                SOAPHeader
                SOAPHeaderElement (SOAPContentElement that holds a fragment)
                SOAPBody
                SOAPBodyElement (SOAPContentElement that holds a fragment)

                RPC Style
                ---------

                SOAPEnvelope
                SOAPHeader
                SOAPHeaderElement (SOAPContentElement that holds a fragment)
                SOAPBody
                SOAPBodyElement (RPC element)
                SOAPElement (SOAPContentElement that holds a fragment)


                SAAJ at this level is perfect to describe the structure of the SOAP message. It starts to be less perfect the deeper you go down the tree
                forcing the DOM view onto the user.


                The important bit is, to allow to go forth and back between the XML and Java Object representation at the SOAPContentElement level
                using the JAXRPC serializer/deserializers.

                Whether to hold the XML representation as String or StAX events is an (important) implementation detail.

                So how about this:

                MessageFactory.createMessage builds the SAAJ view as above with SOAPContentElement(s).
                Upon END_ELEMENT the SOAPContentElement passes the event stream to the deserializer which constructs the java objects
                and assigns it to the SOAPContentElement. If we don't have handlers the java object passes unmodified the endpoint.

                In case we have handlers that modify the content we can alway go back to the DOM view of the content using the corresponding serializer.
                That should account for the majority of use cases and we don't hold a second (xmlFragment) copy of the content.
                As you point out correctly, we take the hit if the handler (or endpoint) is stupid enough to require a DOM view of the content.


                • 5. Re: JBossWS Streaming Implementation Proposal
                  thomas.diesler

                  Jason, this all makes sence. I think you should go ahead and change the signature of

                  public abstract class DeserializerBase implements Deserializer
                  {
                   public abstract Object deserialize(QName xmlName, QName xmlType, String xmlFragment, SerializationContextImpl serContext)
                  }
                  


                  with

                  public abstract class DeserializerBase implements Deserializer
                  {
                   public abstract Object deserialize(QName xmlName, QName xmlType, XMLStreamReader reader, SerializationContextImpl serContext)
                  }
                  





                  • 6. Re: JBossWS Streaming Implementation Proposal
                    jason.greene

                     

                    "thomas.diesler@jboss.com" wrote:
                    Hi Jason,

                    The important bit is, to allow to go forth and back between the XML and Java Object representation at the SOAPContentElement level
                    using the JAXRPC serializer/deserializers.

                    Whether to hold the XML representation as String or StAX events is an (important) implementation detail.
                    So how about this:

                    MessageFactory.createMessage builds the SAAJ view as above with SOAPContentElement(s).
                    Upon END_ELEMENT the SOAPContentElement passes the event stream to the deserializer which constructs the java objects
                    and assigns it to the SOAPContentElement. If we don't have handlers the java object passes unmodified the endpoint.


                    I had thought about adapting SAAJ in some way that would internally use StAX, but there are a couple of problems.
                    1) you have to scan through the whole message to build the outer SAAJ tree
                    2) Since the parser is forward only, you can't backtrack if you need to, which makes lazy loading difficult.

                    Now its possible to cache the event objects for a block, which is more efficiant than a DOM tree, but it still requires you to effectivelly allocate a list of objects that is the size of the entire message. XML fragments have the same problem because you are still allocating a block of memory that is the size of the message.

                    This prompted me to consider an alternative to using SAAJ, though I was not sure if there was a problem with this, since so much of the current design is built around SAAJ. I assumed that the reason for using it, was just to simplify passing elements to handlers.

                    I was also working under the idea of trying to come up with a solution that allowed for parsing an indefinitely large message, with as minimal copies as possible. By not maintaining that SAAJ tree, we no longer have a need to hold on to any of the XML after we process it. So a SOAP message could be 1 TB for all we care, and we would only allocate a few K. This is just refering to space allocated for processing, I realize that if every single element in the SOAP message is actually deserialized into an object of equivalent size, the memory usage would be the same if not larger than the SOAP message. If however only a few elements are mapped to the message, the memory allocation would be quite small.

                    So I guess the questions I have are:
                    Is trying to optimze for a tiny memory footprint unnesessary or unrealistic?
                    Are there other reasons to have an SAAJ tree that I didn't think of?

                    -Jason

                    • 7. Re: JBossWS Streaming Implementation Proposal
                      thomas.diesler

                       


                      1) you have to scan through the whole message to build the outer SAAJ tree


                      I don't think this is true. If we did not have SAAJ, we would still need to represent the SOAP tree with a structure like this

                       Envelope
                       Header ?
                       HeaderElement *
                       Body
                       BodyElement
                       ParamElement *
                      


                      SAAJ is ok for that, it also has a sufficient API for SwA. Not using SAAJ would mean to build some API that in the end is very close to it. Also, we have no choice but to support it because it is a public user API.
                      A user can exercise a WS communication solely by using SAAJ. I don't see a compelling reason for JBossWS not to use it internally at the level shown above. The DOM aspect of SAAJ is of course debatable, especially the deeper you go down the tree, but that's is a different story.

                      There is no need to hold a copy of the incomming message if we can use the serializer/deserializer at the SOAPContentElement level to translate from XML to Java and vice versa. Therefore, even a large incomming message can be (eagerly) streamed to its Java object representation. On demand we can go back to XML using the associated serializer and make the DOM view available if necessary.


                      Is trying to optimze for a tiny memory footprint unnesessary or unrealistic?


                      Optimization is important, but not my major concern at this stage of the JBossWS lifecycle. We should aim for a functional implementation that passes most of the CTS by Mar 2005, ready for JBossWorld.


                      • 8. Re: JBossWS Streaming Implementation Proposal
                        jason.greene

                         

                        "thomas.diesler@jboss.com" wrote:

                        1) you have to scan through the whole message to build the outer SAAJ tree


                        I don't think this is true. If we did not have SAAJ, we would still need to represent the SOAP tree with a structure like this

                         Envelope
                         Header ?
                         HeaderElement *
                         Body
                         BodyElement
                         ParamElement *
                        


                        SAAJ is ok for that, it also has a sufficient API for SwA. Not using SAAJ would mean to build some API that in the end is very close to it.


                        What I am suggesting is that processing is done in the natural order of the message instead of in the order of the OperationDesc. This should avoid the need of building that outer tree structure for general message processing (with the exception of handlers).

                        "thomas.diesler@jboss.com" wrote:

                        Also, we have no choice but to support it because it is a public user API.
                        A user can exercise a WS communication solely by using SAAJ.


                        Yes, I worded my question wrong, I realize the API itself must be supported.

                        "thomas.diesler@jboss.com" wrote:

                        I don't see a compelling reason for JBossWS not to use it internally at the level shown above. The DOM aspect of SAAJ is of course debatable, especially the deeper you go down the tree, but that's is a different story.


                        If the SAAJ tree is used internally, and it is constructed in advance, then the parser is either forced to process the entire message in multiple passes (which requires a copy since the source is a network stream), or it must build a tree containing the entirity of the message.

                        If we deserialize the elements of the message in the order they come in, and without building a tree, we don't have to copy, and we can do everything in one pass (wth the exception of handlers).

                        "thomas.diesler@jboss.com" wrote:

                        There is no need to hold a copy of the incomming message if we can use the serializer/deserializer at the SOAPContentElement level to translate from XML to Java and vice versa. Therefore, even a large incomming message can be (eagerly) streamed to its Java object representation. On demand we can go back to XML using the associated serializer and make the DOM view available if necessary.


                        This is a good idea, though this wouldn't work if a message contained body or header elements that weren't bound to objects, yet were required by a handler. Also, the initial copy to support the interrnal SAAJ tree must still be performed, even though later it could be abandoned.

                        -Jason

                        • 9. Re: JBossWS Streaming Implementation Proposal
                          thomas.diesler

                          For all intends and purposes, the current design that uses a flat SAAJ tree to model the incoming message is fine.

                          The way the MessageFactoryImpl fragments the incoming message is highly ineffient and would greatly benefit from the use of SAX or StAX, but that is an optimiztion issue.

                          Currently, the xmlFragments associated with SOAPContentElements are lazily deserialized, requiring a second parse for each fragment. This could also be optimized through StAX if the jboss binding framework could eagerly create the corresponding java objects during the first parse.

                          As far as StAX is concerned, we also need to check whether the license of the available implementations allow us to include it in the jboss stack. Which implemenation did you have in mind?




                          • 10. Re: JBossWS Streaming Implementation Proposal
                            jason.greene

                             

                            "thomas.diesler@jboss.com" wrote:

                            Currently, the xmlFragments associated with SOAPContentElements are lazily deserialized, requiring a second parse for each fragment. This could also be optimized through StAX if the jboss binding framework could eagerly create the corresponding java objects during the first parse.

                            I think this only way that StAX is worth using, is if parsing and deserialization are done at the same time. IMO the SOAPContentElement should be constructed with the actual java object and no content data (strings or XMLEvent objects). If we have to pull the whole message into memory as a chunk of XMLEvent objects, then things become less effient than DOM.

                            I researched the performance of a SOAP message containing a 10,000 element array of 6 members(not an impracticle use case). The file size was around 2 megs. The memory size and timings follow:
                            DOM: 12MB .512 seconds
                            List of XMLEvent StAX objects: 20MB .700 seconds.
                            


                            As you can see DOM is actually more efficient when used in this manner.

                            "thomas.diesler@jboss.com" wrote:

                            As far as StAX is concerned, we also need to check whether the license of the available implementations allow us to include it in the jboss stack. Which implemenation did you have in mind?


                            The 2 implementations I have been looking at are Sun's and the RI. Sun's implementation (sjsxp) is included in the JWSDP which allows free redistribution. The reference implementation is under the JCP license.

                            -Jason


                            • 11. Re: JBossWS Streaming Implementation Proposal
                              thomas.diesler

                              Excactly, the SOAPContentElements should be constructed with the java object in during the first parse.

                              You will have to work with Alex to see how this integrates with JBossXB. I am all for it.

                              • 12. Re: JBossWS Streaming Implementation Proposal
                                anil.saldhana

                                Jason Greene has brought in a Stax implementation into JBoss HEAD as shown by the following JIRA task:
                                http://jira.jboss.com/jira/browse/JBWS-158