6 Replies Latest reply on Jan 21, 2014 1:21 AM by dward

    SwitchYard performance pitfalls

    synclpz

      I've perfomed some load-tests for SY-based infrastructure and results appear to be very bad

       

      I tried to monitor running AS with SY via JVisualVM, and found potential bottlenecks, but have no idea WHY do some of them appear to execute so slowly.

       

      The infrastructure looks like this:

       

      1. Customized bpm-service quickstart (added custom workItemHandler to BPM process and exposed via SCA service reference)

      2. RemoteInvoker from remote-invoker quickstart's "test" part (made multithreaded and additionally customized while playing, see below)

       

      All two projects attached in dev.zip.

       

      The machine in Core i5-2400 CPU + 8GB memory workstaion, running Windows7 and jdk7u25 x86_64, JBoss EAP6.1 Final (added -Xmx4096M to JAVA_OPTS).

       

      First of all I'd like to say that default HttpInvoker, that is used in SY core and SY remoting is awful in it's performance. Due to usage of HttpUrlConnection it does create up to 2048 connections to sy-remote-servlet and then tomcat gives up making new threads to process requests, so I get socket errors. Anyway, this logic is rather CPU-heavy (because of many-many-many threads) and is not suitable in production environment. I wrote my own implementation of HttpInvoker (see customized RemoteInvoker class) based on apache httpcomponents http-client 4.2.5. It allows real multithreading and blocking upon maximum connections reached. Testing with "clean" JBossAS (simple servlet) resulted in near 5000 TPS, tomcat threads ~150.

       

      Then, I get my custom invoker to point to SY bpm-service (exposed via SCA-binding). Performance dramatically dropped down to ~1200 TPS and some very strange things start happening:

       

      1. Some deserialization exceptions radomly appear while receiving response

      2. Some processing (SwitchYard) exceptions apper while processing request in SY

      3. As time goes by, performance drops down in two ways:

           a) After ~150000 requests processed performance degrade ~twice in a minute

           b) Then it falls down ~logarithmically

       

      After sampling with jvisualvm, the first strange thing that I saw was that we have a bottleneck in classloader (through switchyard's Classes.forName() to org.jboss.modules.JarFileResourceLoader.getClassSpec()), but I can't understand why we need to constantly load classes? I thought it should be done once, at the first execution of service (see first 6 snapshots in attached archive sampling.zip).

       

      The second strange thing that upon performance drop (at ~150000 requests processed) jackson's InternCache starts to dramatically slow down (see snapshots 6-10)! It is invoked upon JSON DEserialization on both server and client side, and both have the same issue... I can't find any solution why does it slow down perfomance cause it's simple cache with 192 elements inside (look at the source org.codehaus.jackson.util.InternCache.intern()). All my thoughts are that it is synchronized method, but number of threads invoking it does not change...

       

      After continuos load test from 19.07 to 22.07 (I tried to process 50 000 000 requests) I found only 15 000 000 processed and system performing at ~120 TPS, using the whole CPU. The hot point is still InternCache.intern() (see the last snapshots in sampling.zip).

       

      Conclusion:

       

      1. HttpInvoker is not good enough for performance and multithreading

      2. Jackson JSON deserializer is, possibly, main performance issue in SY processing

      3. Spontaneous errors processing requests occur (exceptions returned instead of results)

       

      I'll try to perform this test on Linux today also.

       

      Questions:

       

      1. Any possibility to update internal invocation interface to some binary/more fast one? Maybe I should use HorneQ or other bus to integrate services in SY instead of SCA? What about load-balancing then?

      2. Could someone of devs help to investigate spontaneous errors occuring under load (point 3 from conclusion)?

      3. Any recommendations to achieve high-performance, high-thoughput solution?

       

      Thanks in adavnce,

      V.

        • 1. Re: SwitchYard performance pitfalls
          kcbabo

          Hey Viktor,

           

          The in-depth analysis and feedback are appreciated.  Replies inline ...

           

          I've perfomed some load-tests for SY-based infrastructure and results appear to be very bad

           

          We'll get into this in more detail in a bit, but it's important to recognize a few things here:

           

          1) Performance is highly correlated to the components used and implementation of your application.  Certain components (implementations and bindings) are faster than others.

          2) What are your requirements from a functional standpoint - what capabilities does your target application require and what are the performance requirements?

          3) Results were very bad compared to what?

           

          First of all I'd like to say that default HttpInvoker, that is used in SY core and SY remoting is awful in it's performance. Due to usage of HttpUrlConnection it does create up to 2048 connections to sy-remote-servlet and then tomcat gives up making new threads to process requests, so I get socket errors. Anyway, this logic is rather CPU-heavy (because of many-many-many threads) and is not suitable in production environment. I wrote my own implementation of HttpInvoker (see customized RemoteInvoker class) based on apache httpcomponents http-client 4.2.5. It allows real multithreading and blocking upon maximum connections reached. Testing with "clean" JBossAS (simple servlet) resulted in near 5000 TPS, tomcat threads ~150.

           

          Use of HttpURLConnection was deliberate for simplicity and we recognized that performance would be suboptimal at first.  RemoteInvoker is used by remote submission clients and as our internal clustering transport.  No gateways use RemoteInvoker, so the issue is limited to remote clients and clustering.  Definitely something we are going to explore in 1.1, but wanted to give you an idea of the scope.  In fact, contributions are welcome if you would like to help with an optimized implementation. :-)

           

          Then, I get my custom invoker to point to SY bpm-service (exposed via SCA-binding). Performance dramatically dropped down to ~1200 TPS and some very strange things start happening:

           

          I certainly would expect performance to dip if you go from a "simple servlet" to an application which involves BPM workflow.  What is your requirement for throughput on a BPM process?  Have you tested this with jBPM alone to see what performance you get there?

           

          1. Some deserialization exceptions radomly appear while receiving response

          2. Some processing (SwitchYard) exceptions apper while processing request in SY

          3. As time goes by, performance drops down in two ways:

               a) After ~150000 requests processed performance degrade ~twice in a minute

               b) Then it falls down ~logarithmically

           

          After sampling with jvisualvm, the first strange thing that I saw was that we have a bottleneck in classloader (through switchyard's Classes.forName() to org.jboss.modules.JarFileResourceLoader.getClassSpec()), but I can't understand why we need to constantly load classes? I thought it should be done once, at the first execution of service (see first 6 snapshots in attached archive sampling.zip).

           

          Classes should not be loaded per request for this type of application.  I would need to see the complete stack to get an idea of where this is actually coming from though.

           

           

          The second strange thing that upon performance drop (at ~150000 requests processed) jackson's InternCache starts to dramatically slow down (see snapshots 6-10)! It is invoked upon JSON DEserialization on both server and client side, and both have the same issue... I can't find any solution why does it slow down perfomance cause it's simple cache with 192 elements inside (look at the source org.codehaus.jackson.util.InternCache.intern()). All my thoughts are that it is synchronized method, but number of threads invoking it does not change...

           

          After continuos load test from 19.07 to 22.07 (I tried to process 50 000 000 requests) I found only 15 000 000 processed and system performing at ~120 TPS, using the whole CPU. The hot point is still InternCache.intern() (see the last snapshots in sampling.zip).

           

          Similar to the comment on HttpURLConnection, JSON Serialization is another area where we have deliberately non optimized early in order let the remoting implementation take shape.  There are quite a few optimized encodings for JSON data which can be enabled through various libraries.  Early versions of our remoting support included this, but we backed off and went with the default serialization in Jackson as an initial approach.  Again, JSON serialization is used exclusively by the RemoteInvoker in SwitchYard, no other gateway bindings use this directly.

           

           

          Conclusion:

           

          1. HttpInvoker is not good enough for performance and multithreading

          2. Jackson JSON deserializer is, possibly, main performance issue in SY processing

          3. Spontaneous errors processing requests occur (exceptions returned instead of results)

           

          1. HttpInvoker can and will be improved from a performance standpoint.

          2. It certainly seems like there's an issue there that requires further investigation.

          3. Tougher to say without seeing these, but I agree there certainly should not be spontaneous errors.

           

          Questions:

           

          1. Any possibility to update internal invocation interface to some binary/more fast one? Maybe I should use HorneQ or other bus to integrate services in SY instead of SCA? What about load-balancing then?

          2. Could someone of devs help to investigate spontaneous errors occuring under load (point 3 from conclusion)?

          3. Any recommendations to achieve high-performance, high-thoughput solution?

           

          1. There are a number of options here.  We could make it pluggable for custom extensions.  Another area to explore would be using the Netty TCP binding with binary data and see what kind of performance you get there.

          2. We can certainly try and reproduce.  There are a number of things I want to change about your app - for example, the SCA binding for the bpm service is using a WSDL interface when it should be promoted with the same Java interface used by the process service.

          3. Recommendations depend *a lot* on what your requirements are.  Can you describe what this application is expected to do?  Is there a current system in place, and if so, what performance do you see there? 

          • 2. Re: SwitchYard performance pitfalls
            synclpz

            Thanks much for you attention! I didn't have intention to sound rude if you perceived it so

             

            Keith Babo wrote:

             

            We'll get into this in more detail in a bit, but it's important to recognize a few things here:

             

            1) Performance is highly correlated to the components used and implementation of your application.  Certain components (implementations and bindings) are faster than others.

            2) What are your requirements from a functional standpoint - what capabilities does your target application require and what are the performance requirements?

            3) Results were very bad compared to what?

            I'd like to have workflow-based application which performs certain actions upon incoming events received interacting to target systems via SY services, so components used are: service bindings (http, soap, other vendor-specific), beans for pre-processing requests, bpm for executing logic, reference bindings to interact to target systems. For "bad results" I mean lower performance comparing to existing proprietary system, which I try to migrate from. The legacy system is based on numerous old technologies and is hard to maintain and reuse (it is from times of JBoss 4). But service logic execution engine is able to handle up to 4000TPS with "empty" logic, i.e. receive request (via RMI, usually), execute logic workflow of 2-4 nodes (computational tests and logging) and give a response, in a multithreaded manner of course. The hardware was nearly the same.

             

            Use of HttpURLConnection was deliberate for simplicity and we recognized that performance would be suboptimal at first.  RemoteInvoker is used by remote submission clients and as our internal clustering transport.  No gateways use RemoteInvoker, so the issue is limited to remote clients and clustering.  Definitely something we are going to explore in 1.1, but wanted to give you an idea of the scope.  In fact, contributions are welcome if you would like to help with an optimized implementation. :-)

             

            I got it, but I must make a clustered application (for load balancing and HA also!) so the removeinvoker perfomance may be the issue. I'd like to contribute with more optimized implementation of http-client invoker (it is already done and attached to previous post), but I think it would be better to have pluggable remoting implementation for internal interface, one http/json for interoperability and one fast enough based on some messaging service, for example.

             

            What is your requirement for throughput on a BPM process?  Have you tested this with jBPM alone to see what performance you get there?

             

            Around 2000-3000TPS per cluster node, standalone jBPM was not tested (yet, working on it, but it requires to setup "custom" runtime and service environment).

             

            Classes should not be loaded per request for this type of application.  I would need to see the complete stack to get an idea of where this is actually coming from though.

            You can look through the first Visual VM snapshot from my previous post, take a backtrace from the first hot point. The screenshot attached:

             

            2013-07-23 19_20_07-Java VisualVM.png

             

            seems like it is soap-serializer on response returning?

             

             

            Similar to the comment on HttpURLConnection, JSON Serialization is another area where we have deliberately non optimized early in order let the remoting implementation take shape.  There are quite a few optimized encodings for JSON data which can be enabled through various libraries.  Early versions of our remoting support included this, but we backed off and went with the default serialization in Jackson as an initial approach.  Again, JSON serialization is used exclusively by the RemoteInvoker in SwitchYard, no other gateway bindings use this directly.

            Jackson itself is quite fast, even comparing to default java serialization, but I think the issue is rather internal to jackson (found some posts on the net concerning InternCache) or, maybe, some wrong configuration/architecture of serialization on switchyard... I'll try nto investigate deeper.

             

             

             

            Conclusion:

             

            1. HttpInvoker is not good enough for performance and multithreading

            2. Jackson JSON deserializer is, possibly, main performance issue in SY processing

            3. Spontaneous errors processing requests occur (exceptions returned instead of results)

             

            1. HttpInvoker can and will be improved from a performance standpoint.

            2. It certainly seems like there's an issue there that requires further investigation.

            3. Tougher to say without seeing these, but I agree there certainly should not be spontaneous errors.

             

            1. Nice, may be there should come another approach of pluggable mechanisms for clusterwide communications

            2. Yes, sure, please look at backtraces if you got time.

            3. I attached a file generated by output of my test program (every 1000 processed requests it ouputs counter, last received result and current TPS, also outputs result in case of exception): there are exceptions:

             

            a) java.lang.IllegalArgumentException: Can not set boolean field org.switchyard.remote.RemoteMessage._fault to org.switchyard.internal.ContextProperty

            b) java.lang.IllegalArgumentException: Can not set java.util.Set field org.switchyard.internal.ContextProperty._labels to java.lang.Long

            c) java.lang.IllegalArgumentException: Can not set boolean field org.switchyard.remote.RemoteMessage._fault to java.lang.Long

            d) java.lang.IllegalArgumentException: Can not set boolean field org.switchyard.remote.RemoteMessage._fault to java.util.LinkedList

            e) java.lang.IllegalArgumentException: Can not set boolean field org.switchyard.remote.RemoteMessage._fault to java.lang.String

            f) java.lang.IllegalArgumentException: Can not set java.util.Map field org.switchyard.internal.CompositeContext._contexts to org.switchyard.internal.CompositeContext

            g) java.lang.IllegalArgumentException: Can not set java.util.Set field org.switchyard.internal.ContextProperty._labels to java.util.LinkedHashMap

            h) java.lang.IllegalArgumentException: Can not set boolean field org.switchyard.remote.RemoteMessage._fault to java.util.Date

             

            and so on, also:

             

            java.lang.StackOverflowError:

            at java.util.Random.nextInt(Random.java:239)

                      at sun.misc.Hashing.randomHashSeed(Hashing.java:254)

                      at java.util.HashMap.<init>(HashMap.java:255)

                      at java.util.HashMap.<init>(HashMap.java:305)

                      at java.util.LinkedHashMap.<init>(LinkedHashMap.java:198)

                      at org.switchyard.serial.graph.node.MapNode.decompose(MapNode.java:78)

                      at org.switchyard.serial.graph.Graph.decomposeReference(Graph.java:145)

                      at org.switchyard.serial.graph.node.MapNode.decompose(MapNode.java:81)

                      at org.switchyard.serial.graph.Graph.decomposeReference(Graph.java:145)

                      at org.switchyard.serial.graph.node.MapNode.decompose(MapNode.java:81)

             

            .... and last 2 lines 1000 times in a row :-)

             

             

             

            2. We can certainly try and reproduce.  There are a number of things I want to change about your app - for example, the SCA binding for the bpm service is using a WSDL interface when it should be promoted with the same Java interface used by the process service.

            3. Recommendations depend *a lot* on what your requirements are.  Can you describe what this application is expected to do?  Is there a current system in place, and if so, what performance do you see there? 

            2. Is this an issue for SY? It was not prohibited and I tried it for complexity of example :-)

            3. Requirements:

             

            a) Automation workflow system with integration bus capabilities

            b) High-throughput (e.g. 1000s of TPS per node)

            c) Interactions with existing infrastructure services, databases (SQL and NoSQL), external Internet sites and so on

            d) Decision and workflow engine and tooling for rapid development

            e) Clustering and high-availability, persistence and restoring from stopped state support

             

            BR,

            Viktor

            • 3. Re: SwitchYard performance pitfalls
              dward

              Regarding the serialization performance, there are definitely some things that we have the capability to enable that we haven't tried yet.  For example, numeric json, or even other formats completely.  We can also enable compression/decompression of the serialized data in a streaming fashion as it goes out to and in from the wire.  Those have always been things that are deliberately on the roadmap to investigate and test.

               

              Regarding the serialization errors you encountered, specifically the IllegalArgumentExceptions and StackOverflowError, those are disconcerting.  It looks like there is some nondeterministic behavior there that rears its head only under load / multi-threaded scenarios.  I greatly appreciate you bringing it to our attention and attaching the load test/data for us to investigate with.  I will personally look into that.

              • 4. Re: SwitchYard performance pitfalls
                synclpz

                For serialization kryo framework seems to be very fast, but not sure about licencing (its BSD-licenced). Jackson should also be fast enough, I think there may be simply some issue with implementation. The compression should be optional, because it may require additional CPU resources, but it needs testing...

                 

                Concerning errors, its strange, but in JBoss log I got only ONE exception (for that test I did), and no others:

                 

                05:51:47,807 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host].[/switchyard-remote].[SwitchYardRemotingServlet]] (http-/0.0.0.0:8080-86) JBWEB000236: Servlet.service() for servlet SwitchYardRemotingServlet threw exception: java.lang.IllegalArgumentException: Can not set boolean field org.switchyard.remote.RemoteMessage._fault to org.switchyard.internal.DefaultContext

                        at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:146) [rt.jar:1.6.0_41]

                        at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:150) [rt.jar:1.6.0_41]

                        at sun.reflect.UnsafeBooleanFieldAccessorImpl.set(UnsafeBooleanFieldAccessorImpl.java:68) [rt.jar:1.6.0_41]

                        at java.lang.reflect.Field.set(Field.java:657) [rt.jar:1.6.0_41]

                        at org.switchyard.common.type.reflect.FieldAccess.write(FieldAccess.java:124) [switchyard-common-1.0.0.Final.jar:1.0.0.Final]

                        at org.switchyard.serial.graph.node.AccessNode$1.run(AccessNode.java:155) [switchyard-serial-1.0.0.Final.jar:1.0.0.Final]

                        at org.switchyard.serial.graph.Graph.decomposeRoot(Graph.java:131) [switchyard-serial-1.0.0.Final.jar:1.0.0.Final]

                        at org.switchyard.serial.graph.GraphSerializer.deserialize(GraphSerializer.java:61) [switchyard-serial-1.0.0.Final.jar:1.0.0.Final]

                        at org.switchyard.component.sca.SwitchYardRemotingServlet.doPost(SwitchYardRemotingServlet.java:67) [switchyard-component-sca-1.0.0.Final.jar:1.0.0.Final]

                        at javax.servlet.http.HttpServlet.service(HttpServlet.java:754) [jboss-servlet-api_3.0_spec-1.0.2.Final-redhat-1.jar:1.0.2.Final-redhat-1]

                        at javax.servlet.http.HttpServlet.service(HttpServlet.java:847) [jboss-servlet-api_3.0_spec-1.0.2.Final-redhat-1.jar:1.0.2.Final-redhat-1]

                        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:295)

                        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)

                        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)

                        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149)

                        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145)

                        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97)

                        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102)

                        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336)

                        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)

                        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:653)

                        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920)

                        at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_41]


                 

                Thanks for the attention, I'm going to put my SY in production soon

                 

                Seems all the bugs are in SY serialization framework, so may be it should be better to integrate services through, for example, HornetQ? What do you think?

                 

                Message was edited by: synclpz

                • 5. Re: SwitchYard performance pitfalls
                  dward

                  A jira for this issue has been opened here; feel free to "Watch" it: SWITCHYARD-1620

                  • 6. Re: SwitchYard performance pitfalls
                    dward

                    I have a fix (see linked pull request) on the jira now for the serialization errors.