Bug 836395

Summary: Apache 503 after running aeolus-upgrade - httpd restart required
Product: [Retired] CloudForms Cloud Engine Reporter: James Laska <jlaska>
Component: aeolus-conductorAssignee: Steve Linabery <slinaber>
Status: CLOSED WONTFIX QA Contact: Rehana <aeolus-qa-list>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1.0.0CC: cpelland, dmacpher, jeckersb, jturner, morazi, rlandy
Target Milestone: 1.0.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Conductor traffic passes through httpd+mod_proxy before reaching conductor itself (conductor runs via thin). During upgrade (or running aeolus-restart-services), conductor is restarted. The conductor init script will return successfully before thin has fully loaded the conductor app. Accessing conductor during this brief window will result in an HTTP/503 status code (Temporarily Unavailable). The default behavior of httpd/mod_proxy is to mark a backend worker as disabled for 60 seconds if the backend is determined to be unavailable; once the worker is marked as such, mod_proxy will not attempt to pass any traffic to the backend until the retry period expires. Therefor, if a user attempt to access conductor immediately after it starts, but before it has loaded completely, the user will receive an HTTP/503 error for approximately 60 seconds until the mod_proxy retry interval expires and httpd starts forwarding requests to conductor again.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-08-08 19:45:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Upgrade output.txt
none
aeolus-debug.tgz none

Description James Laska 2012-06-28 22:20:38 UTC
Created attachment 595152 [details]
Upgrade output.txt

Description of problem:

After upgrading to CloudForms-1.0.1, and running /usr/share/aeolus-conductor/script/upgrade, the conductor web-ui presents a 503 error.  Restarting httpd resolves the problem and gets conductor working again.

Version-Release number of selected component (if applicable):
 * aeolus-conductor-0.8.33-1.el6cf.src.rpm
 * aeolus-configure-2.5.10-1.el6cf.src.rpm
 * deltacloud-core-0.5.0-10.el6_2.src.rpm
 * rubygem-aeolus-cli-0.3.3-2.el6_2.src.rpm
 * rubygem-aeolus-image-0.3.0-12.el6.src.rpm
 * rubygem-deltacloud-client-0.5.0-2.el6.src.rpm

How reproducible:


Steps to Reproduce:
1. Install CloudForms-1.0
2. Configure rhevm, vsphere and ec2 providers
3. Build/push and deploy images into each provider
4. Upgrade to CloudForms-1.0.1
5. Run /usr/share/aeolus-conductor/script/upgrade
6. Access the conductor web-ui

Actual results:

Step#6 fails with a 503 error

Expected results:

The upgrade script should 
 1) restart httpd in proper order, 
 2) or instruct the admin to restart httpd when complete.

Additional info:

 * See attached command-line output capturing the "Steps to reproduce"
 * See attached aeolus-debug.tgz
 * Restarting httpd resolved the problem

> 10.11.11.9 - - [28/Jun/2012:17:12:06 -0400] "GET /conductor/pools HTTP/1.1" 503 421
> 10.11.11.9 - - [28/Jun/2012:17:12:08 -0400] "GET /conductor/pools HTTP/1.1" 503 421

# service httpd restart

> 10.11.11.9 - - [28/Jun/2012:17:12:15 -0400] "GET /conductor/pools HTTP/1.1" 200 17252

Comment 1 James Laska 2012-06-28 22:21:25 UTC
Created attachment 595153 [details]
aeolus-debug.tgz

Comment 2 Ronelle Landy 2012-06-29 15:39:13 UTC
Followed through the above reproduction steps.

I do get the 'website unavailable' message and then the Apache 503 error if I try access conductor right after running the update script. Refreshing the browser does not help right away.
But, after about a minute,  refresh returned the conductor interface (according to eck, this is expected - copying chat comment ...

<eck> yeah it's 60 seconds - http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass )

Running >> aeolus-restart-services and then accessing conductor right away shows the same behaviour: 
unavailable -> 503 -> conductor shows up

rpms tested:

>> rpm -qa |grep aeolus
aeolus-conductor-0.8.33-1.el6cf.noarch
rubygem-aeolus-image-0.3.0-12.el6.noarch
aeolus-conductor-daemons-0.8.33-1.el6cf.noarch
aeolus-conductor-doc-0.8.33-1.el6cf.noarch
rubygem-aeolus-cli-0.3.3-2.el6_2.noarch
aeolus-configure-2.5.10-1.el6cf.noarch
aeolus-all-0.8.33-1.el6cf.noarch

Comment 3 John Eckersberg 2012-07-03 14:57:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Conductor traffic passes through httpd+mod_proxy before reaching conductor itself (conductor runs via thin).  During upgrade (or running aeolus-restart-services), conductor is restarted.  The conductor init script will return successfully before thin has fully loaded the conductor app.  Accessing conductor during this brief window will result in an HTTP/503 status code (Temporarily Unavailable).  The default behavior of httpd/mod_proxy is to mark a backend worker as disabled for 60 seconds if the backend is determined to be unavailable; once the worker is marked as such, mod_proxy will not attempt to pass any traffic to the backend until the retry period expires.  Therefor, if a user attempt to access conductor immediately after it starts, but before it has loaded completely, the user will receive an HTTP/503 error for approximately 60 seconds until the mod_proxy retry interval expires and httpd starts forwarding requests to conductor again.

Comment 5 Mike Orazi 2012-08-03 17:22:32 UTC
Should this be closed as the release notes were added or are we trying to track something additional here?

Comment 6 James Laska 2012-08-03 18:57:26 UTC
(In reply to comment #5)
> Should this be closed as the release notes were added or are we trying to
> track something additional here?

Is there any interest in eliminating, or reducing, the 60 window where conductor may return a HTTP/503?  

If not, I believe the release note that Dan added documents this behavior.  With that in place, we can CLOSED NOTABUG.

Comment 7 Steve Linabery 2012-08-08 19:25:36 UTC
Since the backend downtime is unpredictable (less than 60, but how much less?) my vote is CLOSED NOTABUG on this.

Comment 8 James Laska 2012-08-08 19:45:14 UTC
By the power vested in me ... I close this bug and cast it out!

I changed my mind and went with CLOSED WONTFIX.  I'm waffling between that an NOTABUG.  Either way ... we aren't going to fix it.