8 Replies Latest reply on Feb 23, 2010 10:47 AM by akarl

    Ping timeout acting as worker timeout

    akarl

      I have a transaction which can take longer than 10 seconds to process and yes this is part of the design and yes to me this is ok.  Now what I'm seeing is that there is a potential problem with the http://www.jboss.org/mod_cluster/java/properties.html#ping setting and how it is applied.  When my transactions take longer than 10 seconds I begin to see the following when using 1.0.3.GA.

      {code}[Thu Feb 11 13:20:38 2010] [error] ajp_cping_cpong: apr_socket_recv failed

      [Thu Feb 11 13:23:11 2010] [error] proxy: dialog to 10.182.13.58:8009 (10.182.13.58) failed [Thu Feb 11 13:23:16 2010] [error] proxy: CLUSTER: (balancer://mycluster). All workers are in error state [Thu Feb 11 13:23:20 2010] [error] proxy: ajp: disabled connection for (10.182.13.58){code}

      on 1.0.0.GA the error is similar but there is one significant difference.

      {code}[Thu Feb 11 12:56:41 2010] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header

      [Thu Feb 11 12:56:41 2010] [error] ajp_read_header: ajp_ilink_receive failed [Thu Feb 11 12:56:41 2010] [error] (120006)APR does not understand this error code: proxy: read response failed from 10.182.13.58:8009 (10.182.13.58)

      {code}

      Note that in the 1.0.0 error the node is NOT disabled.

       

      There is a lot going on here.  I think that the fundamental problem is that the Ping/Pong functionality seems to be tied to the length of the transaction with the node.  I believe that this is not inline with the design which specifies that the Ping/Pong happens prior to the transaction with the node and that is further backed up by the existence of a workerTimeout setting.

       

      Now if I change the ping setting to 30000 (by the way the documentation is misleading in that it claims a default of 10 which is not true, the default is in milliseconds and so should be 10000) my transactions can last up to 30 seconds.  I've tested this rather extensively using all sorts of different transaction lengths (I wrote a web service which lasts a variable length depending on a passed parameter) and everything points to the fact that ping timing is actually more of a transaction timeout.

       

      The really nasty piece here is that in the 1.0.3 log you can see that not only did my transaction fail but the node is disabled as well.  This means that a single transaction can bring down the entire node simply by taking too long.

       

      I didn't see this description anywhere in current issues or posts but I think this could easily be related to some of the other issues with 500 and 503 errors being tossed by mod_cluster.

        • 1. Re: Ping timeout acting as worker timeout
          jfclere

          Try to use nodeTimeout to set a timeout for the node transaction, workerTimeout has nothing to the do with nodeTimeout.

           

          For ping I am sorry but I have checked the code, the default is 10 seconds and the value is in seconds.

          • 2. Re: Ping timeout acting as worker timeout
            akarl
            I stand corrected on the ping setting as far as milliseconds vs seconds.  I think that looking at the values in the mod_cluster-manager threw me off there.  I see the value Ping: 10000000 when I have ping set to 10 seconds and Ping: 30000000 when I have it set to 30 seconds.  If I go to the DUMP output the values are in seconds.
            In any case, my testing still shows a correlation between ping time and transaction duration limits.  Here are some results from my tests.
            In the first two I tried to set a nodeTimeout as suggested by Jean-Frederic as well as the documentation.  This seems to have little effect and my transactions of 10 seconds or longer tend to fail out quickly while anything under 10 seconds is stable.  Note that they do not always fail which may indicate some specific timing of asynchronous tasks being involved.
            In the last tests I modify the settings to make the ping time longer and my 10 second transactions become stable.  If I increase the transaction length to the value of the ping time my transactions become unstable once again.  These results point at a correlation between ping time and transaction duration limitations.
            Test 1
                <property name="maxAttempts">1</property>
                <property name="workerTimeout">60</property>
                <property name="nodeTimeout">30</property>
                <property name="ping">10</property>
            1 Runner at 1 second length
            - 400 Successful Transactions
            - 0 Failures
            2 Runners at 5 second length
            - 100 Successful Transactions
            - 0 Failures
            Test 2
                <property name="maxAttempts">1</property>
                <property name="workerTimeout">60</property>
                <property name="nodeTimeout">30</property>
                <property name="ping">10</property>
            2 Runners at 10 second length
            - 3 Successful transactions
            Failed all runners
            ** Changed settings **
            Test 3
                <property name="maxAttempts">1</property>
                <property name="workerTimeout">60</property>
                <property name="nodeTimeout">30</property>
                <property name="ping">30</property>
            1 Runner at 1 second length
            - 200 Successful Transactions
            - 0 Failures
            2 Runners at 5 second length
            - 100 Successful Transactions
            - 0 Failures
            Test 4
                <property name="maxAttempts">1</property>
                <property name="workerTimeout">60</property>
                <property name="nodeTimeout">30</property>
                <property name="ping">30</property>
            2 Runners at 10 second length
            - 100 Successful transactions
            - 0 Failures
            ** Changed Settings **
            Test 5
                <property name="maxAttempts">1</property>
                <property name="workerTimeout">60</property>
                <property name="nodeTimeout">30</property>
                <property name="ping">20</property>
            4 Runners at 10 second length
            - 200 Successful transactions
            - 0 Failures
            Test 6
                <property name="maxAttempts">1</property>
                <property name="workerTimeout">60</property>
                <property name="nodeTimeout">30</property>
                <property name="ping">20</property>
            4 Runners at 20 second length
            - 5 Successful transactions
            Failed all runners
            • 3. Re: Ping timeout acting as worker timeout
              jfclere

              "I stand corrected on the ping setting as far as milliseconds vs seconds"

              "setting" well no, display... yes INFO and mod_cluster_manager/ displays milliseconds, may be it should  fixed.

               

              For your other tests, I am not convinced: to prove that ping affects nodeTimeoutyou should use the same test and reduce ping and show that when reducing ping you interrupt request/response.

              • 4. Re: Ping timeout acting as worker timeout
                akarl
                Test 2 vs Test 4 do exactly that.  I ran them with the same transaction duration and the only configuration difference was the ping time.  I can run it with the lower ping time more times if you'd like but I have run this test many times now and I get flakiness every time the transaction length meets or exceeds the ping time.  I'll run it 100 times but this will take awhile because when a failure occurs the node is disabled in mod_cluster so I need to wait ~30 seconds for it to become available again.  I'm guessing that is based on when the next STATUS message is sent from the app server but that value would be nice to be able to configure.
                • 5. Re: Ping timeout acting as worker timeout
                  jfclere

                  You should check how many "ajp_handle_cping_cpong: ajp_ilink_receive failed" you get compare to the total number of "ajp_ilink_receive() can't receive header".

                   

                  The failed cping/cpong are marking the node unavailable until a STATUS is received... Normally it is OK that it fails over to another node.

                  If your back-end or network is not able to answer in 10 seconds it because it is overloaded: use a bigger one or increase the ping value.

                   

                  To test the nodeTimeout / ping interaction I suggest you use something like Thread.currentThread().sleep(15000); in the Servlet you will see it works.

                  • 6. Re: Ping timeout acting as worker timeout
                    akarl

                    I'll see what I can do about getting a log from one of these tests.  My time is limited today so I may need to get that Monday. I want to say that there is a one to one ratio of those messages but I'll get a longer log to be sure.

                     

                    I think my results show that the network is not a problem.  I can run quick transactions all over the place without a problem and that would be generating far more network traffic than the longer running transactions.

                     

                    My service implementation does exactly what you suggest.  It puts the thread to sleep for a variable time (I pass the variable as a parameter to the service).  This prevents the CPU on the app server from being overloaded so we should be able to rule out any actual unavailability on the app server due to application activity.

                    • 7. Re: Ping timeout acting as worker timeout
                      jfclere
                      I have created a JIRA MODCLUSTER-132
                      • 8. Re: Ping timeout acting as worker timeout
                        akarl
                        Thanks.  I see some changes went in already.  I'll watch the issue and see if I can get some time to test the fix.