1000764 – App loses connectivity after maintenance events

Bug 1000764 - App loses connectivity after maintenance events

Summary: App loses connectivity after maintenance events

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jhon Honce
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1004521 (view as bug list)
Depends On:
Blocks:	1004521
TreeView+	depends on / blocked

Reported:	2013-08-25 00:29 UTC by bjudson
Modified:	2015-05-14 23:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-19 16:48:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description bjudson 2013-08-25 00:29:11 UTC

Description of problem:

I occasionally (usually about once a month, sometimes more) start getting 503 errors when accessing my app. A restart solves the problem. I noticed that the cause of the 503 is inability to access the database (I'm using Python and Postgres). I also noticed that the problem usually pops up right after a maintenance event.

Version-Release number of selected component (if applicable):

The app is located at:
http://wabistory-saharagray.rhcloud.com/

It uses cartridges:
Python 2.6
PostgreSQL 8.4

How reproducible:

Since I noticed the problem, I set up a cron job on another server to check the status of the app every 10 minutes. When I get connectivity issues, I notice it is usually following a maintenance event. Here are recent outage times (US/Chicago time):
19 Aug - 13:00
24 Jul - 05:40
19 Jul - 11:30
16 Jun - 16:40

The OpenShift status page doesn't show exact dates and times, but I believe you will find this correspond to maintenance events.

Steps to Reproduce:

Wait for maintenance. Check this URL:
https://wabistory-saharagray.rhcloud.com/1/status

The method attempts a very simple database query.

Actual results:

When it is working, returns 200 status with JSON success: true. When it fails, returns a 500 status & default error page.

Expected results:

It should always return 200 status with JSON success: true in the body, except *during* planned outages. 

Additional info:

Event when the above is failing, methods that do not attempt to connect to the database return a 200 status. E.g.
https://wabistory-saharagray.rhcloud.com/

Also discussed in this thread:
https://www.openshift.com/forums/openshift/app-requires-restart-after-maintenance

Comment 1 Nam Duong 2013-08-30 00:34:35 UTC

Posted by @bjudson on 8/29:  
This happened again 27 Aug 22:00, and then 29 Aug 00:20 (I'm on US Central time). The first one I restarted almost immediately, the second I had just go to sleep, so the app was down for about 8 hours.  

Updated the bugs severity to high.

Comment 2 Jhon Honce 2013-09-09 16:02:51 UTC

Several Questions:

#1 Are you using a connection pool to access the database? Or, does your connection configuration have a retry timer for when the database is unavailable?

#2 Is this a scalable application?

#3 What process do you use to restart the application?

Comment 3 Jhon Honce 2013-09-10 21:23:00 UTC

*** Bug 1004521 has been marked as a duplicate of this bug. ***

Comment 4 bjudson 2013-09-10 21:32:06 UTC

I don't know if this matters at this point, but the answers are:

1. I'm using a standard Flask-SQLAlchemy configuration, which if I'm not mistaken uses a connection pool by default.

2. It is not a scalable app.

3. rhc app restart <appname>

I can't access the bug you have marked this a duplicate of, but am I correct in assuming the issue is resolved?

Comment 5 Jhon Honce 2013-09-10 21:41:44 UTC

Fix will go out in next production release. If you experience issues after that, please reopen bug.

Sorry for any inconvenience.

Comment 6 openshift-github-bot 2013-09-11 00:11:39 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/59d0a4b73ab67585d7f69cdf5f37b13bea15d2f1
Bug 1000764 - Enforce cartridge start order

* Start secondary cartridges before primary cartridge

Comment 7 Meng Bo 2013-09-11 10:44:00 UTC

Test on devenv_3772, with jbosseap + postgresql 8.4 jdbc configured.

During app restarting, there is no such error appears in jboss server log.

Also can find the start sequence in the output:

[jbeap1-bmengdev.dev.rhcloud.com 523009b9c6aa501c16000001]\> gear start
Starting gear...
Starting Postgres cartridge
server starting
Postgres started
Starting jbosseap cartridge


Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.