8 Replies Latest reply on Feb 25, 2014 5:07 AM by Domonkos Tomcsanyi

    mod_cluster slow-down

    Domonkos Tomcsanyi Newbie

      Hi everyone,

       

      We have just switched over from mod_jk to mod_cluster around 2 months ago, using currently a single JBoss worker node, but planning to add 2-3 more to the cluster later. It performed without any problems, so we assumed that everything is all right. Today however instead of the usual 30-40 users we got a much larger load of 130 users and that made the system really slow, pretty much unusable. Mainly our Apache instance slowed down to a level that was unbearable. My single test index.html took around 30 seconds to load. Finally something broke in Apache that made it stop logging (yes, there is a one hour gap in our access.log). Meanwhile the virtual machine that was running it had no load at all, the apache process ate around 5-10% CPU, there was a lot of disk space left, and enough free RAM too. We decided to restart the Apache process, but that didn't solve the slowness of the system, so we decided to switch off mod_cluster and change back to a simple ProxyPass directive. After doing that the system immediately recovered, and was able to serve our 130 users easily, and started logging again. I think it is kind of obvious which part of the system is to blame, but I would like to fix this because I worked quite hard to set up mod_cluster and I think it is the right way to do load-balancing with Apache and JBoss workers.

      So please help me figure out what might have happened, tell me if you need log files (however I have only limited amount of them) or anything else.

       

      About the system: two sites are enabled in Apache: one is SSL (for our end users), and the other is simple HTTP for receiving MCMP messages. I use AJP to connect to the JBoss node, and I use PersistSlots On to keep all the node information during Apache restarts. On the /mod_cluster-manager site (which itself took again around 30 seconds to load) I could see that it was able to see the worker node fine.

       

      Thanks.

        • 1. Re: mod_cluster slow-down
          Michal Karm Babacek Apprentice

          Dear Domonkos, this is indeed serious. Do I get it right that there were only 2 contexts served and 1 worker node connected?

          We had this issue: MODCLUSTER-372 Number of registered contexts negatively affects mod_cluster performance

          As it is clear from the title, the performance regression was linked to the amount of registered contexts.

           

          Please, share your Apache HTTP Server configuration files (mod_cluster, httpd conf and anything you find relevant). I'm especially interested in the SSL settings with regard to the mod_cluster configuration. Furthermore, we would need to know the exact version of your Apache HTTP Server, mod_cluster modules and operating system.

          We will try to simulate the environment and reproduce the issue.

          • 2. Re: mod_cluster slow-down
            Domonkos Tomcsanyi Newbie

            Dear Michael,

             

            Thank you for answering so fast!

             

            There was only one context registered and one worker node connected.

             

            My mod_cluster conf looks like this:

              KeepAliveTimeout 60

              MaxKeepAliveRequests 0

              AdvertiseFrequency 5

              ServerAdvertise On http://10.30.3.3:80

              AdvertiseBindAddress 10.30.3.3

              AdvertiseGroup 224.0.0.40:23040

              PersistSlots On

             

            The IP address 10.30.3.3 is managed by Corosync, and it's purpose is only to make mod_cluster advertise on the right network interface. Users come in on a different interface (10.30.2.3, managed by Corosync too), using SSL. The two sites enabled are default Apache sites (default-ssl and default). "Default SSL" is bound to 10.30.2.3, uses the snakeoil certificate, nothing has been changed in it. "Default" is bound to 10.30.3.3 and has EnableMCPMReceive in it so it can communicate with the worker node.

             

            Apache version is: 2.2.22-1ubuntu1.4

            OS: Ubuntu Server 12.04.4 x64

            mod_cluster version is 1.2.6 Final x64 from here:

            mod_cluster 1.2.6.Final bin Downloads - JBoss Community

             

            file says this about one of the downloaded binaries:

            /usr/lib/apache2/modules/mod_advertise.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=0x70a7a6492bb71a91daf7ab2d435740e334088575, not stripped

             

            One more interesting thing is that we naturally did some testing earlier, and used JMeter to load-test our test environment that uses the same configuration. JMeter was slow, so we turned off SSL and that made everything faster (possible JMeter's SSL implementation problem, at least that's what we thought): 300 users simultenously were logged in and used the system without any major hiccups. To still check and test everything I used 'ab' to benchmark Apache's throughput (using SSL here!): 20 000 requests on 15 threads, the requests were all directed to our systems login page (so they went through mod_cluster). No problems were detected, the throughput was fine and the load wasn't major on the server.

            • 3. Re: mod_cluster slow-down
              Domonkos Tomcsanyi Newbie

              I just started to wonder, and thought these 3 things should be mentioned here:

              1. I am using the default apache installation on Ubuntu, which is MPM-worker. I have found some threads saying that it is fine, but I still have a really small doubt in my heart whether mod_cluster is thread safe or not.

              2. There are 3 apache instances running on this very same VM, 2 of them using mod_cluster with different configs (so there is a /etc/apache2-test and there is a /etc/apache2-prod directory, containing different configs)

              3. I have this in my apache2.conf:

               

              <IfModule mpm_worker_module>

                  StartServers          2

                  MinSpareThreads      25

                  MaxSpareThreads      75

                  ThreadLimit          64

                  ThreadsPerChild      25

                  MaxClients          150

                  MaxRequestsPerChild   0

              </IfModule>

              • 4. Re: mod_cluster slow-down
                Jean-Frederic Clere Master

                The test index.html was a proxied file or local httpd one?

                If proxied than the problem is probably in the back-end.

                • 5. Re: mod_cluster slow-down
                  Domonkos Tomcsanyi Newbie

                  It was a local file, in Apache's /var/www folder. Also the URL /mod_cluster-manager took 30 seconds to load. I am really suspicious about the MaxClients setting in apache2.conf, because the same kind of slowness appeared later yesterday (when using ProxyPass) and I was able to solve it by increasing the setting from 150 to 400. I checked via Apache's /server-status URL and I saw there were no idle workers left causing the slow down.

                  • 6. Re: mod_cluster slow-down
                    Jean-Frederic Clere Master

                    Ok it isn't a mod_cluster problem then.

                    You probably have an issue in the JBoss nodes, probably one application is getting slow and block the whole system.

                    use netstat -na to check how many connections are opened between httpd and JBoss probably that number is increasing when you see the slow down also check the load on the JBoss nodes.

                    • 7. Re: mod_cluster slow-down
                      Domonkos Tomcsanyi Newbie

                      I'm sorry, but I don't fully understand your point: if there is a problem with the JBoss node being slow then why would serving a file from /var/www (that has nothing to do with JBoss or mod_cluster) be slow too? I'm almost sure that it was an Apache problem, maybe a mod_cluster problem (but again since I found the MaxClients option I am having a strong feeling that it was the cause of our problem).

                      The problem is that I can't try to switch back to mod_cluster because the system is live and in production and we have users constantly using it, and currently I wasn't able to find a way to reproduce the issue using JMeter. I will keep trying to reproduce the issue somehow.

                      • 8. Re: mod_cluster slow-down
                        Domonkos Tomcsanyi Newbie

                        Today we accidentally switched back to AJP and mod_cluster from ProxyPass and the problem occured again. It is now obvious that the MaxClients directive has been the problem all along: if there are no idle workers left the slow down happens (not suprisingly). It is not a mod_cluster issue, it is a pure Apache issue.

                        So to fix such a slow down increase the number of MaxClients, and if needed the ServerLimit too in apache2.conf/httpd.conf. To calculate to right amount have look at how much memory one apache2 thread uses and then multiply it by the number you are willing to set for MaxClients. Don't forget to save some RAM for the OS too. Example: you have 2 GB of RAM, each apache thread uses around 2MB of RAM, so if you set MaxClients to 800 apache will consume around 1,6 GB so 400 MB is left for other processes/OS.

                        You need to monitor 127.0.0.1/server-status on your apache server to check if you have any idle workers left or not. If you have less than 10 idle workers it is time to increase MaxClients.

                         

                        thank you for all your help!