6 Replies Latest reply on Jul 29, 2003 12:21 PM by jonlee

    Parallel Java thread performance under NPTL

    jonlee

      We side-tracked slightly while looking at JBoss performance. We noticed some noticeable variations in response times and significant enough changes in throughput to question our assumptions about thread performance in a multiprocessor environment.

      The results of our investigation of Java threads on an 8-way Intel multiprocessor Linux system with NPTL (Native POSIX Thread Library) is interesting and not what we expected. We compared performance characteristics of the IBM and Sun Java implementations with respect to
      true parallel throughput.

      We discovered what appears to be an exponential increase in completion time as the number of threads increase for one implementation while the other has a more linear trend. We do caution that other effects may temper the impact of the findings. You can find the document at http://www.amitysolutions.com.au/documents/NPTL_Java_threads.pdf. We hope it is of use for those pursuing performance issues with applications and JBoss.

        • 1. Re: Parallel Java thread performance under NPTL

          Have you tried measuring the times externally?

          People are generally more interested in the response time
          to the client.
          It would eliminate any difference caused by the
          jre's implementation of System.err or Date.

          Also, try using a known implementation,
          something other HashMap
          so you are not dependent on how good the jre collection
          implementation is.

          Regards,
          Adrian

          • 2. Re: Parallel Java thread performance under NPTL
            jonlee

            Good points and well worth addressing.

            One of the reasons I used a HashMap is to have a sufficiently complex operation that would not be trivialised by the JITC or HotSpot. As it is used to pad out and provide a constant timed load, theoretically, it should only shift the polynomial function represented by the average duration and threads up or down the y axis. I'm not too worried if we see linear trends and the points only move up or down. I am worried if we see a discernable non-linear and accelerating expansion.

            A replacement with the following pad out:

            Object thing;
            for (int i = length; i -- > 0;)
            {
            for (int j = 100; j -- > 0; )
            thing = new Object();
            }

            The observed results (length = 100000):
            Sun JDK 1.4.1 03
            1 thread - 795 ms
            8 threads - min 219466 ms, max 234725 ms

            IBM 1.4.1 (NPTL)
            1 thread - 33 ms
            8 threads - min 9 ms, max 33 ms

            Regardless of the different starting point, the expansion of the duration is very noticeable, even to an external observer. However, I believe the IBM system is also performing some optimizations with a simple example like this. Also, the creation of an object seemed to have a relatively high cost for the Sun implementation.

            OK. What if I take out the Object create in the HashMap put?

            Object thing = new Object();
            HashMap map = new HashMap();
            for (int i = length; i -- > 0;)
            {
            for (int j = 100; j -- > 0; )
            map.put(key,thing);
            }

            The observed results (length = 100000):
            Sun JDK 1.4.1 03
            1 thread - 384 ms
            8 threads - min 4750 ms, max 12629 ms

            IBM 1.4.1 (NPTL)
            1 thread - 257 ms
            8 threads - min 58 ms, max 434 ms

            I suspect using simple operations would be trivialised by the JITC/HotSpot so I've avoided such cases. Besides, since the JBoss code uses Collection classes a lot, it seemed in the spirit of testing.

            From the perspective of something like ECperf, the following runs give an idea of the frustrations in testing.

            Sun 1.4.2 - txRate 8, 720 s sample period
            JBoss run 1 - avg. res. 2.450s, Max r. 9.897
            JBoss run 2 - avg. res. 1.429s, Max r. 6.059

            Sun 1.4.2 - txRate 9, 720 s sample period
            JBoss run 1 - avg. res. 1.503s, Max r. 9.471
            JBoss run 2 - avg. res. 3.577s, Max r. 13.111

            As a separate test, running parallel threads doing JNDI lookups provided increasing response times (observed from the client thread) that were similar to that observed for concurrent operations on a single CPU. That was where I originally started questioning complex parallel tasks.

            Of course, I did put provisos in the paper that there are going to be other contributing factors that may cancel the effect observed. But my concern, and perhaps that is too strong a word, is the apparent accelerating rate of increase as indicated by the relative magnitudes.

            • 3. Re: Parallel Java thread performance under NPTL

              I would expect a second run to be faster than the first.
              The first run allows the JIT to do its work.

              That a second is slower seems strange.
              It might point to a major point of contention
              that is not doing fair queuing?

              Do you have the details for each measurement?
              Do you have something that can show where waits
              are occuring? e.g. OptimizeIT

              Regards,
              Adrian

              • 4. Re: Parallel Java thread performance under NPTL
                jonlee

                These are actual clean start runs to ensure fairness and non-bias. Each run pumps through between 10,000 and 13,000 business operations so you would expect that you get some sort of code execution stabilisation early on and sufficient samples afterwards to smooth out major fluctuations - all probabilities being equal in the testing model. The ECperf details were recorded, but only the non-derived values.

                I haven't done much in terms of investigation within JBoss - I just wanted to get a broad understanding of the performance curve. Using the IBM SDK under the same conditions does give reasonable results in that they are in the same ball park - it just gives you greater confidence in the results.

                To be honest, my aim was to understand the noise level in the ECperf results and possible sources of noise. And that was only secondary to testing JBoss throughput. At these levels, the CMP 1.1 engine works fine and there are no major roll-off signs in the throughput. Just fluctuations in response and some changes in throughput that can be concerning enough to question the confidence in the reported values.

                I didn't really want to completely pull apart the architecture and isolate components too much - that might consume more time than I can possibly afford to give at this time. This is after all, just preliminary investigative work before diving into the detail. Swimming, not drowning. :) However, if you introduce non-linear aspects to your system, unless there are checks that govern free-run states, it is possible to increase system instabilities and by that, I mean that you lose a bit of predictability.

                There have been other discussions and we're moving to look at SpecJ for various reasons. But it was interesting to discover some non-linearity in parallel performance - simply because it is not something that I expected.

                • 5. Re: Parallel Java thread performance under NPTL
                  jonlee

                  We've updated the existing report with some supplemental information that shows a less aggressive curve for the results by removing the new Object() operation - based on some of Adrian's questions. However, an additional report http://www.amitysolutions.com.au/documents/Threads-technote.pdf shows that the initial results may be a reasonably normal characteristic. The new report looks at a more complex load consisting of searching for and separately calling an EJB. Although the code looks simple, the underlying JBoss connection and invocation mechanisms provide greater complexity than the HashMap load.

                  • 6. Re: Parallel Java thread performance under NPTL
                    jonlee

                    Finally some results that make sense and provides a theory on the observed performance. We've updated the original document with some new findings and they nicely tie up all observed phenomena. There are still some things that may cause concern, but as we originally stated, this report was more about recording physical phenomena, commenting on it and if possible explaining the observations. The results should not be used to infer any performance in real situations. http://www.amitysolutions.com.au/documents/NPTL_Java_threads.pdf.