Bug 969007 - Scaling has points when all gears are down
Scaling has points when all gears are down
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Mrunal Patel
libra bugs
: Reopened
Depends On: 963490 968994
  Show dependency treegraph
Reported: 2013-05-30 08:58 EDT by Matt Hicks
Modified: 2014-01-29 19:47 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 968994
Last Closed: 2014-01-29 19:47:48 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Matt Hicks 2013-05-30 08:58:38 EDT
Description of problem:
When scaling from 2 to 3 gears, there are points in time when all gears are out of rotation and cause downtime.

Version-Release number of selected component (if applicable):
Codebase as of 3/18/2013

How reproducible:

Steps to Reproduce:
1. Spin up a development instance (assuming dev.rhcloud.com below)

2. Create a scaled JBoss EAP application named scale:
  rhc app-create scalerolling jbosseap -s

3. Scale the application to 3 gears
  rhc cartridge-scale jbosseap --app scalerolling --min 3 --max 6

4. Monitor the application page to make sure the HAProxy status page isn't show - http://scalerolling-YOUR_DOMAIN.rhcloud.com

Actual results:
When the new gears are added to the HAProxy configuration, there are no gears available to serve traffic and the HAProxy status page is shown.

Expected results:
The HAProxy status page is never shown because all instances are never stopped at the same time.

Additional info:
It appears that when the new gears are added to HAProxy, the failures start to occur.  I believe this is because the HAproxy configuration is then updated setting a weight of the only functioning gear to 0 before the other gears are ready to accept traffic.  On the head gear, the java process continued to run and was accessible (curl returned content) but that gear is not in rotation.  At the time of failure, my HAProxy configuration had just been updated to the following (notice the weight of 0 on the local-gear config):

listen express
    cookie GEAR insert indirect nocache
    option httpchk GET /
    balance leastconn
    server  filler backup
    server gear-e37012e8c92611e292b322000a98b42e-mhicksbugs1 check fall 2 rise 3 inter 2000 cookie e37012e8c92611e292b322000a98b42e-mhicksbugs1
    server gear-e3730836c92611e292b322000a98b42e-mhicksbugs1 check fall 2 rise 3 inter 2000 cookie e3730836c92611e292b322000a98b42e-mhicksbugs1
    server local-gear weight 0

After the other gears have started, the application starts showing again.
Comment 1 Mrunal Patel 2013-05-30 22:57:58 EDT
Sent Pull Request to Stage - https://github.com/openshift/origin-server/pull/2704

Already merged into master.
Comment 2 Meng Bo 2013-05-31 05:38:58 EDT
Checked on devenv_3296,

Make the scale app gears from 1 to 3 and from 2 to 3.
During gear up, visit the home page of the app. It will not redirect to /haproxy-status page.
And the local-gear will down only if all the other gears are all up.

Will check it again on devenv-stage_355.
Comment 3 Meng Bo 2013-06-02 23:19:45 EDT
Checked on current Stage which has same build with devenv-stage_356.

Issue has been fixed. Result is same as the devenv_3296

Local-gear will down after all the other 2 web gears up, and home page will not redirect to haproxy-statue page during scale-up.

Move bug to verified.
Comment 4 mzimen 2013-10-25 07:57:00 EDT
During verifying this bug, I found, that when scaling up, the home page is for a while redirected to haproxy-status page.
Tested against recent devenv_3945 (ami-052b746c).

Testing process is done in two separate threads, where in one is scaling-up process and in second one is loop with GET to the app homepage.
Comment 5 Andy Goldstein 2013-10-25 12:26:11 EDT
If you only have a single web proxy (haproxy) gear for your application, it is not possible to scale and have the application remain up 100% of the time. There will be a small window when haproxy is restarting so it can see the new gears, at which point you'll have some downtime. The only way to avoid downtime is to make the application HA so it has at least 2 proxy gears, and then put a load balancer in front of the proxy gears and direct your traffic to the load balancer.

Having said that, it may be taking longer than it should for a single proxy gear to restart (which would account for the 503s).
Comment 6 Mrunal Patel 2013-10-25 19:52:01 EDT
As Andy mentioned there should be a short period when haproxy is reloading when a few requests may be lost and 503s will be seen. However, you shouldn't be seeing the haproxy status page as we don't keep it as a backup page anymore.

(I did a few rounds of testing and saw a few 503s, but only for a short period).
Comment 7 Meng Bo 2013-10-28 02:36:58 EDT
According commnet#5, move the bug to VERIFIED.

Note You need to log in before you can comment on or make changes to this bug.