Created attachment 595152 [details] Upgrade output.txt Description of problem: After upgrading to CloudForms-1.0.1, and running /usr/share/aeolus-conductor/script/upgrade, the conductor web-ui presents a 503 error. Restarting httpd resolves the problem and gets conductor working again. Version-Release number of selected component (if applicable): * aeolus-conductor-0.8.33-1.el6cf.src.rpm * aeolus-configure-2.5.10-1.el6cf.src.rpm * deltacloud-core-0.5.0-10.el6_2.src.rpm * rubygem-aeolus-cli-0.3.3-2.el6_2.src.rpm * rubygem-aeolus-image-0.3.0-12.el6.src.rpm * rubygem-deltacloud-client-0.5.0-2.el6.src.rpm How reproducible: Steps to Reproduce: 1. Install CloudForms-1.0 2. Configure rhevm, vsphere and ec2 providers 3. Build/push and deploy images into each provider 4. Upgrade to CloudForms-1.0.1 5. Run /usr/share/aeolus-conductor/script/upgrade 6. Access the conductor web-ui Actual results: Step#6 fails with a 503 error Expected results: The upgrade script should 1) restart httpd in proper order, 2) or instruct the admin to restart httpd when complete. Additional info: * See attached command-line output capturing the "Steps to reproduce" * See attached aeolus-debug.tgz * Restarting httpd resolved the problem > 10.11.11.9 - - [28/Jun/2012:17:12:06 -0400] "GET /conductor/pools HTTP/1.1" 503 421 > 10.11.11.9 - - [28/Jun/2012:17:12:08 -0400] "GET /conductor/pools HTTP/1.1" 503 421 # service httpd restart > 10.11.11.9 - - [28/Jun/2012:17:12:15 -0400] "GET /conductor/pools HTTP/1.1" 200 17252
Created attachment 595153 [details] aeolus-debug.tgz
Followed through the above reproduction steps. I do get the 'website unavailable' message and then the Apache 503 error if I try access conductor right after running the update script. Refreshing the browser does not help right away. But, after about a minute, refresh returned the conductor interface (according to eck, this is expected - copying chat comment ... <eck> yeah it's 60 seconds - http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass ) Running >> aeolus-restart-services and then accessing conductor right away shows the same behaviour: unavailable -> 503 -> conductor shows up rpms tested: >> rpm -qa |grep aeolus aeolus-conductor-0.8.33-1.el6cf.noarch rubygem-aeolus-image-0.3.0-12.el6.noarch aeolus-conductor-daemons-0.8.33-1.el6cf.noarch aeolus-conductor-doc-0.8.33-1.el6cf.noarch rubygem-aeolus-cli-0.3.3-2.el6_2.noarch aeolus-configure-2.5.10-1.el6cf.noarch aeolus-all-0.8.33-1.el6cf.noarch
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Conductor traffic passes through httpd+mod_proxy before reaching conductor itself (conductor runs via thin). During upgrade (or running aeolus-restart-services), conductor is restarted. The conductor init script will return successfully before thin has fully loaded the conductor app. Accessing conductor during this brief window will result in an HTTP/503 status code (Temporarily Unavailable). The default behavior of httpd/mod_proxy is to mark a backend worker as disabled for 60 seconds if the backend is determined to be unavailable; once the worker is marked as such, mod_proxy will not attempt to pass any traffic to the backend until the retry period expires. Therefor, if a user attempt to access conductor immediately after it starts, but before it has loaded completely, the user will receive an HTTP/503 error for approximately 60 seconds until the mod_proxy retry interval expires and httpd starts forwarding requests to conductor again.
Should this be closed as the release notes were added or are we trying to track something additional here?
(In reply to comment #5) > Should this be closed as the release notes were added or are we trying to > track something additional here? Is there any interest in eliminating, or reducing, the 60 window where conductor may return a HTTP/503? If not, I believe the release note that Dan added documents this behavior. With that in place, we can CLOSED NOTABUG.
Since the backend downtime is unpredictable (less than 60, but how much less?) my vote is CLOSED NOTABUG on this.
By the power vested in me ... I close this bug and cast it out! I changed my mind and went with CLOSED WONTFIX. I'm waffling between that an NOTABUG. Either way ... we aren't going to fix it.