14 Replies Latest reply on May 1, 2013 3:53 PM by bdecoste

Showcase on Openshift keeps failing

bleathem Feb 26, 2013 12:52 PM

Hello all,

The Openshift showcase is repeatedly failing. We have a monitor set up (uptimerobot.com) which notifies us when it fails, but it's failing frequently. Here is the outage report for the past 2 months:

 02/26/2013 06:21:12    Down    No Response From The Website.    
 02/19/2013 07:23:56    Up    Successful response received.    
 02/19/2013 05:49:20    Down    No Response From The Website.    
 02/16/2013 11:03:01    Up    Successful response received.    
 02/15/2013 08:59:45    Down    No Response From The Website.    
 02/14/2013 06:43:09    Up    Successful response received.    
 02/14/2013 01:52:51    Down    No Response From The Website.    
 02/13/2013 20:40:07    Down    No Response From The Website.    
 02/13/2013 06:04:43    Up    Successful response received.    
 02/13/2013 05:01:33    Down    No Response From The Website.    
 01/29/2013 13:07:04    Up    Successful response received.    
 01/29/2013 06:37:40    Down    No Response From The Website.    
 01/17/2013 09:10:34    Up    Successful response received.    
 01/17/2013 09:10:33    Up    Successful response received.    
 01/17/2013 08:58:14    Down    No Response From The Website.    
 01/16/2013 00:19:11    Up    Successful response received.    
 01/15/2013 07:26:23    Down    No Response From The Website.    
 01/10/2013 09:28:52    Up    Successful response received.    
 01/10/2013 08:19:59    Down    No Response From The Website.    
 01/01/2013 12:17:34    Up    Successful response received.

The RichFaces dev and qe teams get notification when it fails, and we restart it based on whoever reacts first. Until now, we haven't done anything to remedy the problem - I'd like to change that.

For starters we need to track the events themselves. We need to know:

when it failed
why it failed (server.log? openshift infrastructure problem?)
what we did to get it going again (rhc command? irc/e-mail discussion with the openshift team?)

Then we can review the information when time for it allows, and look at implementing a fix, or reporting a systematic problem to the openshift team.

My question is where should we track these outages?

A wiki it too unstructured.
Should we use jira?
- Do we do this as a single issue with a comment for each event?
- One issue for each event?
Any other SaaS tools people recommend for tracking production issues?

Brian

1. Re: Showcase on Openshift keeps failing

lfryc Feb 26, 2013 5:16 PM (in response to bleathem)

For a start, we could switch to EAP6 cartridge and see how it helps with this issue.
Actions
2. Re: Showcase on Openshift keeps failing

bleathem Feb 26, 2013 5:30 PM (in response to lfryc)

Good point Lukas, that will certainly remedy some of the issues we've been noticing. Let's go ahead and do that. I've filed RFPL-2756 to track the task.

However, that doesn't resolve the problem that we need to do a better job tracking the cause and remedy associated with an outage. Any alternatives to suggest other than jira?
Actions
3. Re: Showcase on Openshift keeps failing

bdecoste Mar 8, 2013 10:12 AM (in response to bleathem)

Is this on OpenShift Online or Origin/on-premis/OSE? If you could provide the server.log file(s) I could help. Email is wdecoste@redhat.com. Applications are periodically restarted by our Online Ops team for upgrades and patches. We have seen times when significant applications do not deploy on a restart because of resource constraints. The current deploy timeout is 5m - if the app hasn't depployed within that 5 mins then the deployment is rolled back. The logs will show if this is the case. If so it's easily correctable by increasing the timeout settings in standalone.xml
Actions
4. Re: Showcase on Openshift keeps failing

lfryc Mar 12, 2013 9:46 AM (in response to bdecoste)

Thanks for a offer of a help, William!

I will send you fresh logs once I will observe a problem again.

I wouldn't say we have problem with deployment timeouts though.
Actions
5. Re: Showcase on Openshift keeps failing

bleathem Mar 20, 2013 6:21 PM (in response to lfryc)

Thanks William,

The point of this post was to collect information on our failures so that we have something meaningful to bring upstream. For instance, I know a number of our failures are related to the disk filling up. We just need to get our cron log rotation setup to resolve that. Some of the failures are app errors, where a DNS resolution fails, and the app doesn't fail gracefully (moreso for the RF 3 showcaes here).

So let's go ahead an create a jira issue to track the failures (RFPL-2804), and we'll create a sub-issue for each individual failure with the stack trace and resolution in the comments. Once we've collected a few of these we'll have a good understanding of what is causing our failures, and we can bring that to William and the OpenShift team.

Brian
Actions
6. Re: Showcase on Openshift keeps failing

bleathem Mar 20, 2013 6:30 PM (in response to bleathem)

I recorded the first failure report of the shocase in issue: RFPL-2805.
Actions
7. Re: Showcase on Openshift keeps failing

bdecoste Mar 21, 2013 8:31 AM (in response to bleathem)

Is this occuring on OpenShift Online, Origin, or OSE? Is this an instance of the AS7, EAP6, or EWS1/2 cartridges? Is the source for the application available publically (e.g. github)?

Thanks -Bill
Actions
8. Re: Showcase on Openshift keeps failing

lfryc Apr 17, 2013 7:10 AM (in response to bdecoste)
Hey Bill,

this is currently hosted on:

OpenShift Online
AS7
the source is here: https://github.com/richfaces/showcase

The demo might get a quite a lot hits, so you won't probably be able to reproduce the issue without soak testing it.
Actions
9. Re: Showcase on Openshift keeps failing

lfryc Apr 17, 2013 7:51 AM (in response to lfryc)

I think I was able to get a problem in a case when the quota was exceeded.

I have been cleaning just logs/ folder, but tmp/ folder was also full of logs.

With clearing of logs/, we were able to get number of full blocks from 1,042,608 to 1,032,088.

With cleaning of tmp/, the full blocks went down from 1,032,088 to 99,260 (which was effectively ~917MB).

Freshly started instance needs 126,052.
Actions
10. Re: Showcase on Openshift keeps failing

bleathem May 1, 2013 2:47 PM (in response to lfryc)

A recent failiure (RFPL-2856) was caused by a java.net.SocketException.

@Bill - any idea why we would experience this SocketException on Openshift?
Actions
11. Re: Showcase on Openshift keeps failing

bdecoste May 1, 2013 2:57 PM (in response to bleathem)

Couple things - you can request more disk quota for your gear(s). It will only delay the quota exceeded problem as we don't have anything that automatically clears out logs/disk. That's currently the responsibility of the user. That said, we are looking at options for rolling logs handled by OpenShift.

Is there any other info around the SocketException? This could be purely a resource issue, likey cpu limitations. If it was threads/memory you'd see something in the logs. If you can send me the whole server.log I can take a look.

If you send me the the app url I can request the ops guys to look at the correspinding Node.

Thanks -Bill
Actions
12. Re: Showcase on Openshift keeps failing

bleathem May 1, 2013 3:25 PM (in response to bdecoste)

Thanks Bill,

the server log is attached to the corresponding jira issue: https://issues.jboss.org/browse/RFPL-2856

The url is: http://showcase-richfaces.rhcloud.com/
Actions
13. Re: Showcase on Openshift keeps failing

bdecoste May 1, 2013 3:35 PM (in response to bleathem)

Thanks. I've asked the ops team to take a look at the corresponding Node. Is this a small or medium gear? Just a guess but the socket exception is probably a timeout on the client side. Probably do to the limited resources in the gear and a slow response.
Actions
14. Re: Showcase on Openshift keeps failing

bdecoste May 1, 2013 3:53 PM (in response to bdecoste)

Looks like there may have been some problems with rouint between the Node Apache and the app. Ops restarted both.

[Wed May 01 15:45:09 2013] [error] (111)Connection refused: proxy: HTTP: attempt to connect to 127.10.118.1:8080 (*) failed
Actions

Go to original post