Red Hat Bugzilla – Bug 969007
Scaling has points when all gears are down
Last modified: 2014-01-29 19:47:48 EST
Description of problem:
When scaling from 2 to 3 gears, there are points in time when all gears are out of rotation and cause downtime.
Version-Release number of selected component (if applicable):
Codebase as of 3/18/2013
Steps to Reproduce:
1. Spin up a development instance (assuming dev.rhcloud.com below)
2. Create a scaled JBoss EAP application named scale:
rhc app-create scalerolling jbosseap -s
3. Scale the application to 3 gears
rhc cartridge-scale jbosseap --app scalerolling --min 3 --max 6
4. Monitor the application page to make sure the HAProxy status page isn't show - http://scalerolling-YOUR_DOMAIN.rhcloud.com
When the new gears are added to the HAProxy configuration, there are no gears available to serve traffic and the HAProxy status page is shown.
The HAProxy status page is never shown because all instances are never stopped at the same time.
It appears that when the new gears are added to HAProxy, the failures start to occur. I believe this is because the HAproxy configuration is then updated setting a weight of the only functioning gear to 0 before the other gears are ready to accept traffic. On the head gear, the java process continued to run and was accessible (curl 127.0.253.129:8080 returned content) but that gear is not in rotation. At the time of failure, my HAProxy configuration had just been updated to the following (notice the weight of 0 on the local-gear config):
listen express 127.0.253.130:8080
cookie GEAR insert indirect nocache
option httpchk GET /
server filler 127.0.253.131:8080 backup
server gear-e37012e8c92611e292b322000a98b42e-mhicksbugs1 10.152.180.46:35571 check fall 2 rise 3 inter 2000 cookie e37012e8c92611e292b322000a98b42e-mhicksbugs1
server gear-e3730836c92611e292b322000a98b42e-mhicksbugs1 10.152.180.46:35576 check fall 2 rise 3 inter 2000 cookie e3730836c92611e292b322000a98b42e-mhicksbugs1
server local-gear 127.0.253.129:8080 weight 0
After the other gears have started, the application starts showing again.
Sent Pull Request to Stage - https://github.com/openshift/origin-server/pull/2704
Already merged into master.
Checked on devenv_3296,
Make the scale app gears from 1 to 3 and from 2 to 3.
During gear up, visit the home page of the app. It will not redirect to /haproxy-status page.
And the local-gear will down only if all the other gears are all up.
Will check it again on devenv-stage_355.
Checked on current Stage which has same build with devenv-stage_356.
Issue has been fixed. Result is same as the devenv_3296
Local-gear will down after all the other 2 web gears up, and home page will not redirect to haproxy-statue page during scale-up.
Move bug to verified.
During verifying this bug, I found, that when scaling up, the home page is for a while redirected to haproxy-status page.
Tested against recent devenv_3945 (ami-052b746c).
Testing process is done in two separate threads, where in one is scaling-up process and in second one is loop with GET to the app homepage.
If you only have a single web proxy (haproxy) gear for your application, it is not possible to scale and have the application remain up 100% of the time. There will be a small window when haproxy is restarting so it can see the new gears, at which point you'll have some downtime. The only way to avoid downtime is to make the application HA so it has at least 2 proxy gears, and then put a load balancer in front of the proxy gears and direct your traffic to the load balancer.
Having said that, it may be taking longer than it should for a single proxy gear to restart (which would account for the 503s).
As Andy mentioned there should be a short period when haproxy is reloading when a few requests may be lost and 503s will be seen. However, you shouldn't be seeing the haproxy status page as we don't keep it as a backup page anymore.
(I did a few rounds of testing and saw a few 503s, but only for a short period).
According commnet#5, move the bug to VERIFIED.