Bug 1006085

Summary: haproxy_ctld keeps sending scale-up events in error conditions
Product: OpenShift Online Reporter: Rajat Chopra <rchopra>
Component: ContainersAssignee: Dan McPherson <dmcphers>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: bmeng, yadu
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-19 16:50:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rajat Chopra 2013-09-10 01:37:10 UTC
Description of problem:
The auto-scaler daemon in the haproxy cartridge is designed to send scale-up events in case of traffic. But in case the broker sends back some error, the daemon does not hold back on sending further scale-up requests. 
Not sure what the solution should be, but in case the broker is stuck with the application's pending_op queue, we end up having infinite 10-second interval requests sent to the broker.

Version-Release number of selected component (if applicable):


How reproducible:
Always in a broken app, with lot of traffic.

Steps to Reproduce:
1. Create a scalable app and create a pending op that is broken (type missing or something). Or limit the application's gears by putting a cap on the user's max_gear limit.
2. Send traffic to the application
3. Check scale_events.log, we see the relentless scale-up requests even if they are failing.

Actual results:
haproxy_ctld keeps sending scale-up events every 10 seconds.

Expected results:
haproxy_ctld should be smart in the case when broker is sending back errors on the scale-up requests, and stop flooding the already beleagured broker.


Additional info:

Comment 1 Dan McPherson 2013-09-10 22:49:58 UTC
https://github.com/openshift/origin-server/pull/3611

It now waits for 10 mins to scale up or down if the last request to scale up or down failed.

Comment 3 Meng Bo 2013-09-11 12:23:18 UTC
Checked on devenv_3771, it still will send GEAR_UP event for about every 1 minutes.

I, [2013-09-11T08:17:42.471038 #5516]  INFO -- : GEAR_UP - capacity: 109.375% gear_count: 2 sessions: 35 up_thresh: 90.0%
I, [2013-09-11T08:18:17.831972 #5516]  INFO -- : GEAR_UP - add-gear: exit: 0  stdout: Already at the maximum number of gears allowed for either the app or your account.

I, [2013-09-11T08:18:22.918896 #5516]  INFO -- : GEAR_UP - capacity: 100.0% gear_count: 2 sessions: 32 up_thresh: 90.0%
I, [2013-09-11T08:18:38.093305 #5516]  INFO -- : GEAR_UP - add-gear: exit: 0  stdout: Already at the maximum number of gears allowed for either the app or your account.

I, [2013-09-11T08:19:09.461168 #5516]  INFO -- : GEAR_UP - capacity: 106.25% gear_count: 2 sessions: 34 up_thresh: 90.0%
I, [2013-09-11T08:19:27.447006 #5516]  INFO -- : GEAR_UP - add-gear: exit: 0  stdout: Already at the maximum number of gears allowed for either the app or your account.

I, [2013-09-11T08:20:09.131737 #5516]  INFO -- : GEAR_UP - capacity: 128.125% gear_count: 2 sessions: 41 up_thresh: 90.0%
I, [2013-09-11T08:20:26.942704 #5516]  INFO -- : GEAR_UP - add-gear: exit: 0  stdout: Already at the maximum number of gears allowed for either the app or your account.


Not sure why the add-gear failed exit code is 0, maybe caused by this?

Comment 4 Dan McPherson 2013-09-11 15:49:14 UTC
https://github.com/openshift/origin-server/pull/3618

Comment 6 Meng Bo 2013-09-12 07:57:15 UTC
Tested again on devenv_3776, issue has been fixed.
It will re-trigger GEAR_UP event for every 10 minutes.

I, [2013-09-12T02:58:45.285264 #15431]  INFO -- : Starting haproxy_ctld
I, [2013-09-12T02:58:45.327543 #15431]  INFO -- : GEAR_UP - capacity: 93.75% gear_count: 2 sessions: 30 up_thresh: 90.0%
I, [2013-09-12T02:59:01.987247 #15627]  INFO -- : Starting haproxy_ctld
I, [2013-09-12T03:00:10.709793 #15627]  INFO -- : GEAR_UP - capacity: 109.375% gear_count: 2 sessions: 35 up_thresh: 90.0%
I, [2013-09-12T03:00:27.532465 #15627]  INFO -- : GEAR_UP - add-gear: exit: 1  stdout: Already at the maximum number of gears allowed for either the app or your account.

I, [2013-09-12T03:10:40.333767 #15627]  INFO -- : GEAR_UP - capacity: 93.75% gear_count: 2 sessions: 30 up_thresh: 90.0%
I, [2013-09-12T03:11:21.386182 #15627]  INFO -- : GEAR_UP - add-gear: exit: 1  stdout: Already at the maximum number of gears allowed for either the app or your account.

I, [2013-09-12T03:21:25.224495 #15627]  INFO -- : GEAR_UP - capacity: 93.75% gear_count: 2 sessions: 30 up_thresh: 90.0%
I, [2013-09-12T03:21:40.343013 #15627]  INFO -- : GEAR_UP - add-gear: exit: 1  stdout: Already at the maximum number of gears allowed for either the app or your account.

I, [2013-09-12T03:31:51.258314 #15627]  INFO -- : GEAR_UP - capacity: 106.25% gear_count: 2 sessions: 34 up_thresh: 90.0%
I, [2013-09-12T03:32:17.398757 #15627]  INFO -- : GEAR_UP - add-gear: exit: 1  stdout: Already at the maximum number of gears allowed for either the app or your account.

I, [2013-09-12T03:42:21.763609 #15627]  INFO -- : GEAR_UP - capacity: 153.125% gear_count: 2 sessions: 49 up_thresh: 90.0%
I, [2013-09-12T03:42:36.486227 #15627]  INFO -- : GEAR_UP - add-gear: exit: 1  stdout: Already at the maximum number of gears allowed for either the app or your account.