1 2 Previous Next 20 Replies Latest reply on Apr 19, 2018 10:32 AM by Andrew Dinn Branched to a new discussion.

    Server startup too slow because of potential logic issue in LoadCache.lookupClass

    Rajiv Shivane Newbie

      We noticed our server startup became very slow when we setup our Agent. Investigating further we notice that LoadCache.lookupClass is called for the class "com.sun.xml.internal.bind.v2.runtime.unmarshaller.Receiver" (which is in boot classpath) with the Tomcat Application Classloader as the parameter. However, unlike other classes, this method is called with the same loader thousands of times for this class.

       

      Looking at the LoadCache code: https://git.io/vxPJD it appears that if lookupClass is looking for a boot class in boot classloader it would work (cache the class in bootMap). However, if it searches for boot class in the hierarchy of application classloader, then in infact does not cache the boot class, which results in the search executing again and again.

       

      When we call LoadCache.lookupClass("com.sun.xml.internal.bind.v2.runtime.unmarshaller.Receiver", AppClassLoader), the method loops to search over the classloader parent hierarchy (LibClassLoader, SystemClassLoader, ExtClassloader) to finally check in boot classloader (null). At this point loaderMap is being set to bootMap (via line 114 loaderMap = bootMap). However, the loop does not execute the search for the class in bootclassloader (ie line #94 is not executed with inst.getInitiatedClasses(null);) because the do-while loop exits due to the condition while(loader!=null)

       

      It appears we could get around this by changing the do-while loop to:

      while(true)

      {

        Class[] classes = inst.getInitiatedClasses(loader);

         for (int i = 0; i < classes.length; i++) {

        clazz = classes[i];

         if (clazz.getName().equals(name)) {

         // ok, install this class in the base map and the map we found it in
         synchronized (loaderMap) {

        loaderMap.put(name, clazz);

         if (loader != baseLoader) {

         synchronized (baseLoaderMap) {

        baseLoaderMap.put(name, clazz);

        }

        }

         return clazz;

        }

        }

        }

       

         if (loader == null) {

         break;

        }

        loader = loader.getParent();

         if (loader == null) {

        loaderMap = bootMap;

        } else {

         synchronized (loaderMaps) {

        loaderMap = loaderMaps.get(loader);

         if (loaderMap == null) {

        loaderMap = new HashMap<String, Class>();

         loaderMaps.put(loader, loaderMap);

        }

        }

        }

      };

        • 1. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
          Andrew Dinn Master

          Hi Rajiv,

           

          Oh, very nice catch! Thank you for spotting and reporting this very dumb mistake (doh!) and also for suggesting a fix. I have had a few reports of performance issues with this part of the agent code that were troubling me. However, I never had a clear enough pointer to where the difficulty lay to be able to work out where I had gone wrong. I am sure this fix will be a big help to many Byteman users so I am very grateful to have my error pin-pointed so accurately :-)

           

          I have raised BYTEMAN-363  for this bug. I have been planning a new release for the last 2 weeks but it has been delayed by important work on OpenJDK. Luckily, that means I should be able to get the fix for this bug into that release!

           

          Have you already tested the patch you suggested? Or do you want me to fix it and deploy a snapshot jar to the Maven Central repo for you to test?

           

          Thanks very much for your help.

           

          regards,

           

           

          Andrew Dinn

          • 2. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
            Rajiv Shivane Newbie

            Hi Andrew,

             

            No, I did not get a chance to test the patch yet. So I don't know for sure how much performance gain this fix actually gives

             

            I need to figure out how to get my project's maven to pick up my locally modified byteman.jar instead of the one from central repo. Having a fix available as a snapshot on maven central would make it a lot more convenient for me to build and test my project.

             

            Thanks,

            Rajiv

            • 3. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
              Andrew Dinn Master

              Hi Rajiv,

              rajiv.shivane  wrote:

               

              No, I did not get a chance to test the patch yet. So I don't know for sure how much performance gain this fix actually gives

               

              I need to figure out how to get my project's maven to pick up my locally modified byteman.jar instead of the one from central repo. Having a fix available as a snapshot on maven central would make it a lot more convenient for me to build and test my project.

               

              Well . . . Are you using BMUnit for testing? If so then it is easy to make it load the byteman agent from a jar in a local build tree. The magic is actually explained in Byteman tutorial number 2 where the relevant pom settings are cleared precisely to avoid picking up a local jar. If you scroll down through the example pom settings you will see that the surefire test configuration includes the following settings:

               

                ...
                <!-- ensure we don't inherit a byteman jar from any env settings -->
                <environmentVariables>
                  <BYTEMAN_HOME></BYTEMAN_HOME>
                </environmentVariables>
                <systemProperties>
                  <property>
                    <name>org.jboss.byteman.home</name>
                    <value></value>
                  <property>
                  ...
                <systemProperties>

               

              When the BMUnit runner tries to load the agent it looks for a local byteman.jar in the dir identfiied by environment variable $BYTEMAN_HOME (%BYTEMAN_HOME% on Windows) or, failing that, by system property org.jboss.byteman.home. It is only after both these lookups fail that BMUnit tries to locate byteman.jar (or byteman-x.y.x.jar) on the classpath. So, this pom is set up for people who have also insatleld a downlaoded version. It ensures that the pom always uses the jar from the classpath (which ought to match the BMUnit jar, assuming the pom dependencies are correct). If you reset BYTEMAN_HOME to point to a local install of the byetman code then BMUnit will use the jar from that install instead. The dir pointed to should look like what you get when you unzip the release zip files that get built as the download target i.e. the dir should have a subdir called lib which contains byteman.jar.

               

              Also, I don;t know if you have tried building your patched tree locally but if so then to build it form the master branch then you just run mvn install with a JDK9/10 java in your PATH. If that is a bit too leading edge then switch to git branch 3.X which you can build with a JDK6/7/8 java.

               

              regards,

               

               

              Andrew Dinn

              • 4. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                Andrew Dinn Master

                Hi Rajiv,

                 

                I have pushed a 4.0.2-SNAPSHOT release of Byteman to the Sonatype snapshots repo. I'd be grateful if you could test it with your app to see if it fixes the load cache insertion failure.

                 

                If it works then I will do a full 4.0.2 release ASAP.

                 

                regards,

                 

                 

                Andrew Dinn

                • 5. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                  Rajiv Shivane Newbie

                  Hi Andrew, I tried building my project with 4.0.2-SNAPSHOT, but it is failing some tests.

                  We are currently on 3.0.11, which works fine. I also tried with 4.0.1 and everything works well.

                  So it is some change in byteman between 4.0.1 and 4.0.2-SNAPSHOT that is causing the issue. I tried to lookup the changes between these two versions on github, but github does not have 4.0.1 tag.

                  Is there a way for me to see the list of changes between 4.0.1 and 4.0.2-SNAPSHOT?

                  Also is there release notes where I can see differences between 3.0.11 and 4.0.1?

                   

                  Thanks,

                  Rajiv

                  • 6. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                    Rajiv Shivane Newbie

                    Do you plan to ship the fix as part of 3.0.12? I do not see a snapshot for 3.0.12 on sonatype. Wondering if it would be a smaller change from 3.0.11 to 3.0.12 (instead of 3.0.11 to 4.0.2) to certify our project.    

                    • 7. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                      Rajiv Shivane Newbie

                      Looks like 4.0.2 has a new method ensureHelperClass which is throwing NPE in our case. I see that this change is in 3.X branch too, so trying with 3.0.12-SNAPSHOT will have the same issue.

                      Will investigate further.

                       

                       

                      java.lang.NullPointerException

                              at org.jboss.byteman.rule.Rule.ensureHelperClass(Rule.java:1007)

                              at org.jboss.byteman.rule.Rule.typeCheck(Rule.java:568)

                              at org.jboss.byteman.rule.Rule.ensureTypeCheckedCompiled(Rule.java:521)

                              at org.jboss.byteman.rule.Rule.execute(Rule.java:796)

                              at org.jboss.byteman.rule.Rule.execute(Rule.java:777)

                              at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java)

                      • 8. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                        Andrew Dinn Master

                        Hi Rajiv,

                         

                        There are several other fixes in the 4.0.2-SNAPSHOT release. I think the one you are looking at is BYTEMAN-362

                         

                          https://github.com/bytemanproject/byteman/commit/3e366bc979e37d40f639b06851dd995ec6dcff0b

                         

                        This adds an explicit check to see if the helper class is final and throws an exception if so. However, the error occurs on line 1007, before this check is executed. So, I don't see how this can be the cause of the problem.

                         

                        Since the NPE is thrown on this line it appears that it is helperLoader which is null. There is another related change for BYTEMAN-357 which, amongst other things, affects the intiialization of helperLoader:

                         

                          https://github.com/bytemanproject/byteman/commit/10a3326ba53afc3b74d246bd8d4066bad919ea5d

                         

                        The point of this change isi to defer the load of the helper form the point where the Rule is created (i.e. inject time) to the point where it is type checked (i.e first trigger-time). I cannot see how this could be the cause of the problem either. In both the old code and the new code the same call to loadClass occurs in the same circumstances. In the new code we have

                         

                        if (helperClass == null) {
                          try {
                            helperLoader.loadClass(helperToUse);
                            . . .
                        
                        

                         

                        Now, the 3 fields in play here, helperClass, helperToUse and helperLoader, are all set in the constructor for the rule. This does allow for a case where helperLoader is null but that only happens when helperClass is set to Helper.class i.e. when it is non-null:

                         

                        this.helperClass = null;
                        . . .
                        if (helperName != null || isCompileToBytecode() || imports.length > 0) {
                          String helperToUse = (helperName != null) ? helperName : Helper.class.getName();
                          this.helperLoader = getModuleSystem().createLoader(targetLoader, imports);
                          this.helperToUse = helperToUse;
                          typeGroup = new TypeGroup(helperLoader);
                        } else {
                          this.helperLoader = null;
                          this.helperToUse = Helper.class.getName();
                          helperClass = Helper.class;
                          typeGroup = new TypeGroup(targetLoader);
                        }

                         

                        If we take the false (else) branch setting helperLoader = null then helperClass will be assigned to Helper.class. So, line 1007 will not be executed. If we take the true (if) branch then helperClass will be left as null. However, helperLoader should always be non-null in that case. The call to getModuleSystem() returns an instance of NonModuleSystem or JBossModuleSystem. NonModuleSystem always returns a new ClassByteClassLoader, JBossModuleSystem also always returns a loader, even for error cases, either a ModuleUsingClassbyteClassLoader or a ClassbyteClassLoader. So, I really don't understand how the NPE is happening here.

                         

                        I would really like to get to the bottom of this error. Could you run either

                          i) provide me with instructions how to run your test

                          ii) provide me with a simple reproducer that manifests the problem or

                          iii) run this yourself in a debugger to check that the NPE is caused by helperLoader being null and, if possible, also step through the initialization code to see how it is being set

                         

                        Thanks for whatever further info you can provide.

                         

                         

                        Andrew Dinn

                         

                        1 of 1 people found this helpful
                        • 9. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                          Rajiv Shivane Newbie

                          Hi Andrew, Thanks for the details. I reached the same conclusion that helperLoader should always be non-null.

                           

                          Fortunately, the issue was replicable in debug mode too. I was able to connect using a debugger and identify the issue in our code.

                          We were holding a weak reference to a custom classloader in our ModuleSystem, that was getting garbage collected, causing the NPE.

                           

                          After fixing this issue, I can confirm that all our functionality is working fine with 4.0.2-SNAPSHOT!

                          However, I need to work with a customer to confirm the performance improvement. I will do that tomorrow.

                           

                          Thanks,

                          Rajiv

                          • 10. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                            Andrew Dinn Master

                            rajiv.shivane  wrote:

                             

                            Do you plan to ship the fix as part of 3.0.12? I do not see a snapshot for 3.0.12 on sonatype. Wondering if it would be a smaller change from 3.0.11 to 3.0.12 (instead of 3.0.11 to 4.0.2) to certify our project.    

                             

                            The master branch of Byteman needs JDK9 to build. The 3.X branch is being maintained in order to provide a version of Byteman that can be built on RHEL using JDK8 (this will continue until JDK11 becomes the default build JDK for RHEL). Changes made to the 4.X series in the master repo which are not specific to JDK9 are backported to 3.X in order to keep the two versions as close as possible in functionality. So, yes, I do plan to ship this fix in 3.0.12.

                             

                            There is supposed to be no noticeable difference in operation between a 3.X release and a 4.X release if you are running on JDK8 or lower (3.X will actually work for many cases on JDK9 or higher but you may see problems if you try to access non-public fields or invoke non-public methods of classes which are not exported by a Jigsaw module).

                             

                            The only internal differences between the releases are a certain amount of refactoring to allow for the presence of modules and a change to store rule variables as instance fields of a generated helper subclass rather than in a separate hash table. Otherwise, all patches applied to 4.X release that do not relate to JDK9-specific code have also been applied to 3.X. Neither of these internal differences is supposed to affect observable behaviour for Byteman client code but, of course, they might unintentionally do so.

                             

                            I have just pushed a 3.0.12-SNAPSHOT release to the Sonatype snapshots repo. Perhaps you could test with that version to see if the problem is specific to the 4.0.2-SNAPSHOT or also manifests with the 3.0.12-SNAPSHOT.

                             

                            regards,

                             

                             

                            Andrew Dinn

                            • 11. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                              Rajiv Shivane Newbie

                              adinn  wrote:

                               

                              There is supposed to be no noticeable difference in operation between a 3.X release and a 4.X release if you are running on JDK8 or lower (3.X will actually work for many cases on JDK9 or higher but you may see problems if you try to access non-public fields or invoke non-public methods of classes which are not exported by a Jigsaw module).

                               

                              The only internal differences between the releases are a certain amount of refactoring to allow for the presence of modules and a change to store rule variables as instance fields of a generated helper subclass rather than in a separate hash table. Otherwise, all patches applied to 4.X release that do not relate to JDK9-specific code have also been applied to 3.X. Neither of these internal differences is supposed to affect observable behaviour for Byteman client code but, of course, they might unintentionally do so.

                               

                              Thanks Andrew, this is very helpful background. We are not supporting JDK 9 yet, but sounds like it would be better for us to be on 4.X since 4.X seems to be run-time compatible with both JDK8 and JDK9

                               

                              adinn  wrote:

                               

                              I have just pushed a 3.0.12-SNAPSHOT release to the Sonatype snapshots repo. Perhaps you could test with that version to see if the problem is specific to the 4.0.2-SNAPSHOT or also manifests with the 3.0.12-SNAPSHOT.

                              Thanks for pushing 3.0.12-SNAPSHOT. I have confirmed that the fix on our side and 3.0.12-SNAPSHOT are fully functional. I will confirm the performance improvements tomorrow. But if you are pressed for time to make the release, I think you should go ahead since all functionality seems fine.

                               

                              Apologies for having you chase a red herring on this NullPointerException; and many thanks for all your support!

                               

                              Regards,

                              Rajiv

                              • 12. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                                Andrew Dinn Master

                                Hi Rajiv,

                                rajiv.shivane  wrote:

                                 

                                Hi Andrew, Thanks for the details. I reached the same conclusion that helperLoader should always be non-null.

                                 

                                Thanks for the confirmation. I was beginning to doubt my own sanity or, worse, my ability to read code correctly! (some of my best code has been both readable and, arguably, borderline insane :-)

                                 

                                rajiv.shivane  wrote:

                                 

                                Fortunately, the issue was replicable in debug mode too. I was able to connect using a debugger and identify the issue in our code.

                                We were holding a weak reference to a custom classloader in our ModuleSystem, that was getting garbage collected, causing the NPE.

                                 

                                After fixing this issue, I can confirm that all our functionality is working fine with 4.0.2-SNAPSHOT!

                                However, I need to work with a customer to confirm the performance improvement. I will do that tomorrow.

                                 

                                Wow, that's an interesting outcome. I had never considered that this code might be hostage to a weak reference.

                                 

                                Glad to hear it is now working. Let me know how much it helps performance once you are able to run a test. If it is really significant then I need to do a bit of explaining to one or two people who have mentioned a similar performance issue.

                                 

                                regards,

                                 

                                 

                                Andrew Dinn

                                • 13. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                                  Andrew Dinn Master

                                  rajiv.shivane  wrote:

                                   

                                  Thanks Andrew, this is very helpful background. We are not supporting JDK 9 yet, but sounds like it would be better for us to be on 4.X since 4.X seems to be run-time compatible with both JDK8 and JDK9

                                   

                                  4.X is the preferred release train not just because the 3.X series is a legacy RHEL artefact. The fix I mentioned which is not about JDK9 (using Helper instance fields in place of a hash map) was done as a performance improvement (I had been wanting to do it for about 5 years so I used JDK9 as the excuse to push it in :-).

                                   

                                  Storing rule bindings as helper instance fields is much more efficient than using an external hash map both for space (no need to create a HashMap instance for every rule triggering) and time (trading off 1 hash table initialize and N hash table gets/puts using the var name String as key vs an already present object init cost and N field gets/puts).

                                   

                                  The allocs of both helper and HashMap instances will usually be short-lived so ought to be pretty much free (perhaps a small increase in young-gen collect overhead, depending on how hard your app is driving the memory system). The field vs HashMap access trade off applies not just for rule local vars introduced in a BIND clause but also for special vars like $1, $2, $this, $! etc. So, improvement to rule execution time is probably going to be more significant. With only a few rules neither of these costs will not really matter. If you are using a lot of rules, say to trace execution of many classes, then it may make a noticeable saving.

                                   

                                  regards,

                                   

                                   

                                  Andrew Dinn

                                  • 14. Re: Server startup too slow because of potential logic issue in LoadCache.lookupClass
                                    Rajiv Shivane Newbie

                                    adinn  wrote:

                                    4.X is the preferred release train not just because the 3.X series is a legacy RHEL artefact. The fix I mentioned which is not about JDK9 (using Helper instance fields in place of a hash map) was done as a performance improvement (I had been wanting to do it for about 5 years so I used JDK9 as the excuse to push it in :-).

                                    Sweet!

                                    adinn  wrote:

                                    Storing rule bindings as helper instance fields is much more efficient than using an external hash map both for space (no need to create a HashMap instance for every rule triggering) and time (trading off 1 hash table initialize and N hash table gets/puts using the var name String as key vs an already present object init cost and N field gets/puts).

                                    Sounds great. We will definitely move to 4.X

                                     

                                    Now that you mention switching from a HashMap to field, it reminds me of a feature request. I will start a new thread about it, but here is roughly what I was thinking:

                                    In our usage, there are many times a RuleHelper wants to store some properties against an object that was instrumented. For example when a user does DataSource.getConnection(), we want to store the Datasource.name onto the connection, so that later when we enter/exit methods on the Connection object the RuleHelper has access to the DataSource name. Today we keep these properties against the object in a WeakHashMap. It would be a lot more convenient if we added a HashMap field to all the classes we do bytecode transformation and the RuleHelpers had access to this HashMap. That way it would be simpler, performant, less prone to GC errors etc

                                    1 2 Previous Next