969007 – Scaling has points when all gears are down

Bug 969007 - Scaling has points when all gears are down

Summary: Scaling has points when all gears are down

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	1.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mrunal Patel
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	963490 968994
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-30 12:58 UTC by Matt Hicks
Modified:	2014-01-30 00:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	968994
Environment:
Last Closed:	2014-01-30 00:47:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matt Hicks 2013-05-30 12:58:38 UTC

Description of problem:
When scaling from 2 to 3 gears, there are points in time when all gears are out of rotation and cause downtime.

Version-Release number of selected component (if applicable):
Codebase as of 3/18/2013

How reproducible:
Consistent

Steps to Reproduce:
1. Spin up a development instance (assuming dev.rhcloud.com below)

2. Create a scaled JBoss EAP application named scale:
rhc app-create scalerolling jbosseap -s

3. Scale the application to 3 gears
rhc cartridge-scale jbosseap --app scalerolling --min 3 --max 6

4. Monitor the application page to make sure the HAProxy status page isn't show - http://scalerolling-YOUR_DOMAIN.rhcloud.com

Actual results:
When the new gears are added to the HAProxy configuration, there are no gears available to serve traffic and the HAProxy status page is shown.

Expected results:
The HAProxy status page is never shown because all instances are never stopped at the same time.

Additional info:
It appears that when the new gears are added to HAProxy, the failures start to occur. I believe this is because the HAproxy configuration is then updated setting a weight of the only functioning gear to 0 before the other gears are ready to accept traffic. On the head gear, the java process continued to run and was accessible (curl 127.0.253.129:8080 returned content) but that gear is not in rotation. At the time of failure, my HAProxy configuration had just been updated to the following (notice the weight of 0 on the local-gear config):

listen express 127.0.253.130:8080
cookie GEAR insert indirect nocache
option httpchk GET /
balance leastconn
server filler 127.0.253.131:8080 backup
server gear-e37012e8c92611e292b322000a98b42e-mhicksbugs1 10.152.180.46:35571 check fall 2 rise 3 inter 2000 cookie e37012e8c92611e292b322000a98b42e-mhicksbugs1
server gear-e3730836c92611e292b322000a98b42e-mhicksbugs1 10.152.180.46:35576 check fall 2 rise 3 inter 2000 cookie e3730836c92611e292b322000a98b42e-mhicksbugs1
server local-gear 127.0.253.129:8080 weight 0

After the other gears have started, the application starts showing again.

Comment 1 Mrunal Patel 2013-05-31 02:57:58 UTC

Sent Pull Request to Stage - https://github.com/openshift/origin-server/pull/2704

Already merged into master.

Comment 2 Meng Bo 2013-05-31 09:38:58 UTC

Checked on devenv_3296,

Make the scale app gears from 1 to 3 and from 2 to 3.
During gear up, visit the home page of the app. It will not redirect to /haproxy-status page.
And the local-gear will down only if all the other gears are all up.

Will check it again on devenv-stage_355.

Comment 3 Meng Bo 2013-06-03 03:19:45 UTC

Checked on current Stage which has same build with devenv-stage_356.

Issue has been fixed. Result is same as the devenv_3296

Local-gear will down after all the other 2 web gears up, and home page will not redirect to haproxy-statue page during scale-up.

Move bug to verified.

Comment 4 mzimen 2013-10-25 11:57:00 UTC

During verifying this bug, I found, that when scaling up, the home page is for a while redirected to haproxy-status page.
Tested against recent devenv_3945 (ami-052b746c).

Testing process is done in two separate threads, where in one is scaling-up process and in second one is loop with GET to the app homepage.

Comment 5 Andy Goldstein 2013-10-25 16:26:11 UTC

If you only have a single web proxy (haproxy) gear for your application, it is not possible to scale and have the application remain up 100% of the time. There will be a small window when haproxy is restarting so it can see the new gears, at which point you'll have some downtime. The only way to avoid downtime is to make the application HA so it has at least 2 proxy gears, and then put a load balancer in front of the proxy gears and direct your traffic to the load balancer.

Having said that, it may be taking longer than it should for a single proxy gear to restart (which would account for the 503s).

Comment 6 Mrunal Patel 2013-10-25 23:52:01 UTC

As Andy mentioned there should be a short period when haproxy is reloading when a few requests may be lost and 503s will be seen. However, you shouldn't be seeing the haproxy status page as we don't keep it as a backup page anymore.

(I did a few rounds of testing and saw a few 503s, but only for a short period).

Comment 7 Meng Bo 2013-10-28 06:36:58 UTC

According commnet#5, move the bug to VERIFIED.

Note You need to log in before you can comment on or make changes to this bug.