Bug 1000764

Summary: App loses connectivity after maintenance events
Product: OpenShift Online Reporter: bjudson <ben>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.xCC: ben, bmeng, nduong, stauil
Target Milestone: ---Keywords: SupportQuestion
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-19 16:48:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1004521    

Description bjudson 2013-08-25 00:29:11 UTC
Description of problem:

I occasionally (usually about once a month, sometimes more) start getting 503 errors when accessing my app. A restart solves the problem. I noticed that the cause of the 503 is inability to access the database (I'm using Python and Postgres). I also noticed that the problem usually pops up right after a maintenance event.

Version-Release number of selected component (if applicable):

The app is located at:
http://wabistory-saharagray.rhcloud.com/

It uses cartridges:
Python 2.6
PostgreSQL 8.4

How reproducible:

Since I noticed the problem, I set up a cron job on another server to check the status of the app every 10 minutes. When I get connectivity issues, I notice it is usually following a maintenance event. Here are recent outage times (US/Chicago time):
19 Aug - 13:00
24 Jul - 05:40
19 Jul - 11:30
16 Jun - 16:40

The OpenShift status page doesn't show exact dates and times, but I believe you will find this correspond to maintenance events.

Steps to Reproduce:

Wait for maintenance. Check this URL:
https://wabistory-saharagray.rhcloud.com/1/status

The method attempts a very simple database query.

Actual results:

When it is working, returns 200 status with JSON success: true. When it fails, returns a 500 status & default error page.

Expected results:

It should always return 200 status with JSON success: true in the body, except *during* planned outages. 

Additional info:

Event when the above is failing, methods that do not attempt to connect to the database return a 200 status. E.g.
https://wabistory-saharagray.rhcloud.com/

Also discussed in this thread:
https://www.openshift.com/forums/openshift/app-requires-restart-after-maintenance

Comment 1 Nam Duong 2013-08-30 00:34:35 UTC
Posted by @bjudson on 8/29:  
This happened again 27 Aug 22:00, and then 29 Aug 00:20 (I'm on US Central time). The first one I restarted almost immediately, the second I had just go to sleep, so the app was down for about 8 hours.  

Updated the bugs severity to high.

Comment 2 Jhon Honce 2013-09-09 16:02:51 UTC
Several Questions:

#1 Are you using a connection pool to access the database? Or, does your connection configuration have a retry timer for when the database is unavailable?

#2 Is this a scalable application?

#3 What process do you use to restart the application?

Comment 3 Jhon Honce 2013-09-10 21:23:00 UTC
*** Bug 1004521 has been marked as a duplicate of this bug. ***

Comment 4 bjudson 2013-09-10 21:32:06 UTC
I don't know if this matters at this point, but the answers are:

1. I'm using a standard Flask-SQLAlchemy configuration, which if I'm not mistaken uses a connection pool by default.

2. It is not a scalable app.

3. rhc app restart <appname>

I can't access the bug you have marked this a duplicate of, but am I correct in assuming the issue is resolved?

Comment 5 Jhon Honce 2013-09-10 21:41:44 UTC
Fix will go out in next production release. If you experience issues after that, please reopen bug.

Sorry for any inconvenience.

Comment 6 openshift-github-bot 2013-09-11 00:11:39 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/59d0a4b73ab67585d7f69cdf5f37b13bea15d2f1
Bug 1000764 - Enforce cartridge start order

* Start secondary cartridges before primary cartridge

Comment 7 Meng Bo 2013-09-11 10:44:00 UTC
Test on devenv_3772, with jbosseap + postgresql 8.4 jdbc configured.

During app restarting, there is no such error appears in jboss server log.

Also can find the start sequence in the output:

[jbeap1-bmengdev.dev.rhcloud.com 523009b9c6aa501c16000001]\> gear start
Starting gear...
Starting Postgres cartridge
server starting
Postgres started
Starting jbosseap cartridge


Move bug to verified.