14 Replies Latest reply on May 1, 2013 3:53 PM by bdecoste

    Showcase on Openshift keeps failing

    bleathem

      Hello all,

       

      The Openshift showcase is repeatedly failing.  We have a monitor set up (uptimerobot.com) which notifies us when it fails, but it's failing frequently.  Here is the outage report for the past 2 months:

       

       02/26/2013 06:21:12    Down    No Response From The Website.    
       02/19/2013 07:23:56    Up    Successful response received.    
       02/19/2013 05:49:20    Down    No Response From The Website.    
       02/16/2013 11:03:01    Up    Successful response received.    
       02/15/2013 08:59:45    Down    No Response From The Website.    
       02/14/2013 06:43:09    Up    Successful response received.    
       02/14/2013 01:52:51    Down    No Response From The Website.    
       02/13/2013 20:40:07    Down    No Response From The Website.    
       02/13/2013 06:04:43    Up    Successful response received.    
       02/13/2013 05:01:33    Down    No Response From The Website.    
       01/29/2013 13:07:04    Up    Successful response received.    
       01/29/2013 06:37:40    Down    No Response From The Website.    
       01/17/2013 09:10:34    Up    Successful response received.    
       01/17/2013 09:10:33    Up    Successful response received.    
       01/17/2013 08:58:14    Down    No Response From The Website.    
       01/16/2013 00:19:11    Up    Successful response received.    
       01/15/2013 07:26:23    Down    No Response From The Website.    
       01/10/2013 09:28:52    Up    Successful response received.    
       01/10/2013 08:19:59    Down    No Response From The Website.    
       01/01/2013 12:17:34    Up    Successful response received.    
      

       

      The RichFaces dev and qe teams get notification when it fails, and we restart it based on whoever reacts first.  Until now, we haven't done anything to remedy the problem - I'd like to change that.

       

      For starters we need to track the events themselves.  We need to know:

       

      1. when it failed
      2. why it failed (server.log? openshift infrastructure problem?)
      3. what we did to get it going again (rhc command? irc/e-mail discussion with the openshift team?)

       

      Then we can review the information when time for it allows, and look at implementing a fix, or reporting a systematic problem to the openshift team.

       

      My question is where should we track these outages?

      • A wiki it too unstructured. 
      • Should we use jira? 
        • Do we do this as a single issue with a comment for each event? 
        • One issue for each event?
      • Any other SaaS tools people recommend for tracking production issues?

       

      Brian

        • 1. Re: Showcase on Openshift keeps failing
          lfryc

          For a start, we could switch to EAP6 cartridge and see how it helps with this issue.

          • 2. Re: Showcase on Openshift keeps failing
            bleathem

            Good point Lukas, that will certainly remedy some of the issues we've been noticing.  Let's go ahead and do that.  I've filed RFPL-2756 to track the task.

             

            However, that doesn't resolve the problem that we need to do a better job tracking the cause and remedy associated with an outage.  Any alternatives to suggest other than jira?

            • 3. Re: Showcase on Openshift keeps failing
              bdecoste

              Is this on OpenShift Online or Origin/on-premis/OSE? If you could provide the server.log file(s) I could help. Email is wdecoste@redhat.com. Applications are periodically restarted by our Online Ops team for upgrades and patches. We have seen times when significant applications do not deploy on a restart because of resource constraints. The current deploy timeout is 5m - if the app hasn't depployed within that 5 mins then the deployment is rolled back. The logs will show if this is the case. If so it's easily correctable by increasing the timeout settings in standalone.xml

              • 4. Re: Showcase on Openshift keeps failing
                lfryc

                Thanks for a offer of a help, William!

                 

                I will send you fresh logs once I will observe a problem again.

                 

                I wouldn't say we have problem with deployment timeouts though.

                • 5. Re: Showcase on Openshift keeps failing
                  bleathem

                  Thanks William,

                   

                  The point of this post was to collect information on our failures so that we have something meaningful to bring upstream.  For instance, I know a number of our failures are related to the disk filling up.  We just need to get our cron log rotation setup to resolve that.  Some of the failures are app errors, where a DNS resolution fails, and the app doesn't fail gracefully (moreso for the RF 3 showcaes here).

                   

                  So let's go ahead an create a jira issue to track the failures (RFPL-2804), and we'll create a sub-issue for each individual failure with the stack trace and resolution in the comments.  Once we've collected a few of these we'll have a good understanding of what is causing our failures, and we can bring that to William and the OpenShift team.

                   

                  Brian

                  • 6. Re: Showcase on Openshift keeps failing
                    bleathem

                    I recorded the first failure report of the shocase in issue: RFPL-2805.

                    • 7. Re: Showcase on Openshift keeps failing
                      bdecoste

                      Is this occuring on OpenShift Online, Origin, or OSE? Is this an instance of the AS7, EAP6, or EWS1/2 cartridges? Is the source for the application available publically (e.g. github)?

                       

                      Thanks -Bill

                      • 8. Re: Showcase on Openshift keeps failing
                        lfryc

                        Hey Bill,

                         

                        this is currently hosted on:

                         

                         

                        The demo might get a quite a lot hits, so you won't probably be able to reproduce the issue without soak testing it.

                        • 9. Re: Showcase on Openshift keeps failing
                          lfryc

                          I think I was able to get a problem in a case when the quota was exceeded.

                           

                          I have been cleaning just logs/ folder, but tmp/ folder was also full of logs.

                           

                           

                          With clearing of logs/, we were able to get number of full blocks from 1,042,608 to 1,032,088.

                           

                          With cleaning of tmp/, the full blocks went down from 1,032,088 to 99,260 (which was effectively ~917MB).

                           

                          Freshly started instance needs 126,052.

                          • 10. Re: Showcase on Openshift keeps failing
                            bleathem

                            A recent failiure (RFPL-2856) was caused by a java.net.SocketException.

                             

                            @Bill - any idea why we would experience this SocketException on Openshift?

                            • 11. Re: Showcase on Openshift keeps failing
                              bdecoste

                              Couple things - you can request more disk quota for your gear(s). It will only delay the quota exceeded problem as we don't have anything that automatically clears out logs/disk. That's currently the responsibility of the user. That said, we are looking at options for rolling logs handled by OpenShift.

                               

                              Is there any other info around the SocketException? This could be purely a resource issue, likey cpu limitations. If it was threads/memory you'd see something in the logs. If you can send me the whole server.log I can take a look.

                               

                              If you send me the the app url I can request the ops guys to look at the correspinding Node.

                               

                              Thanks -Bill

                              • 12. Re: Showcase on Openshift keeps failing
                                bleathem

                                Thanks Bill,

                                 

                                the server log is attached to the corresponding jira issue: https://issues.jboss.org/browse/RFPL-2856

                                 

                                The url is: http://showcase-richfaces.rhcloud.com/

                                • 13. Re: Showcase on Openshift keeps failing
                                  bdecoste

                                  Thanks. I've asked the ops team to take a look at the corresponding Node. Is this a small or medium gear? Just a guess but the socket exception is probably a timeout on the client side. Probably do to the limited resources in the gear and a slow response.

                                  • 14. Re: Showcase on Openshift keeps failing
                                    bdecoste

                                    Looks like there may have been some problems with rouint between the Node Apache and the app. Ops restarted both.

                                     

                                    [Wed May 01 15:45:09 2013] [error] (111)Connection refused: proxy: HTTP: attempt to connect to 127.10.118.1:8080 (*) failed