Bug 1036728 - Orange PoC Escalation with OpenShift auto scaling
Summary: Orange PoC Escalation with OpenShift auto scaling
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 1.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Luke Meyer
QA Contact: libra bugs
URL:
Whiteboard:
Depends On: 1056700 1057183
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-12-02 14:15 UTC by Shashin Shinde
Modified: 2023-09-14 01:54 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1056700 (view as bug list)
Environment:
Last Closed: 2014-03-21 20:02:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Document containing their test results (527.93 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2013-12-02 14:15 UTC, Shashin Shinde
no flags Details

Description Shashin Shinde 2013-12-02 14:15:13 UTC
Created attachment 831603 [details]
Document containing their test results

Description of problem:

Hello All,

We had a call with Orange ( doing a PoC evaluation ) on Fri Nov 29, 2031 and here is the summary of it.

They are facing a problem with OpenShift (OS) entering in a loop state where it keeps adding a gear and removing gears even though the load is constant.

Here is our understanding of the issue with following Parameters and Assumptions : 
Application is hit with 100 requests / second constantly without change over a period of time.
Example Gear threshold is 30 ( i.e. number of requests served / per gear )

Given the above numbers :
OpenShift will keep on adding gears to a scalable application until it reaches the number 4.
3 gears will serve 90 req / sec which is not enough.
However, with 4 gears total capacity is much higher than needed. And so it will scale down to 3 again.

Result is that it will constantly create the fourth gear and remove it and then add it and then remove it …………

A simple logical idea to fix it might be to say - if a scale down event will trigger a scale up event based on current load then do not do it.
Hope that helps to clarify the situation and what they are facing.

Please let me know how would we like to proceed here ?
I have convinced them that this fix is not needed for Dec 11 event and they are fine with that.

But, the still want and expect an answer from us on when will we be able to fix it.

Thanks,
Shashin.

Refer to the following bugzilla for how to test it.
https://bugzilla.redhat.com/show_bug.cgi?id=1035613

Comment 2 Brenton Leanhardt 2013-12-04 19:04:57 UTC
If traffic is expected to be sustained for a long period of time I think it would be best to set the min-scale for the web cartridge to what they want.  That way they can allow OpenShift to autoscale whenever needed (beyond what they expect).

Comment 3 Luke Meyer 2013-12-26 23:35:15 UTC
I'm not sure I'm getting the same results from looking at the math. Can you see something I'm doing wrong?

If the gear capacity is 16, here are the gear up/down thresholds at each number of web framework gears:

#      1    2    3    4    5    6    7 ...
up    15   29   44   58   72   87  101
down  na   14   28   40   54   67   81
gdt      .405 .585 .630 .675 .705 .726

(Gear up is when 90% of capacity is taken. Gear down threshold is more complicated, multiplier shown as gdt.) Notice that gear down threshold is always below the previous gear up threshold, meaning that if it's low enough to trigger the gear down, it's too low to trigger gear up, just like they expect.

With gear capacity at 30, the table looks different:
#      1    2    3    4    5    6    7 ...
up    27   54   81  108  135  162  189
down  na   24   52   75  101  126  152
gdt      .405 .585 .630 .675 .705 .726

... but it still looks like the thresholds overlap so there should be no flapping.

To test what happens in reality, I'm going to set gear capacity to 30 and run this for a while:
ab -ki -c 100 -n 9999999 http://scaleme-demo.ose201.example.com/

I'll have a look at the logs tomorrow.

A couple of things to mention:
1. By setting the gear capacity to 30, they're modifying the algorithm. In general, if they're going to do that, they may need to fix the rest of the algorithm to match. It's not clear to me whether there's ever a capacity at which mathematically there are holes between the thresholds causing flap. But I don't think it's wise to just adjust one number without looking at all the effects.
2. The gears themselves decide how many clients to allow based on per-gear RAM. For those that leverage httpd (like PHP), a standard small gear actually sets MaxClients to 17 (have a look at php/conf/performance.conf in a PHP app). So gears will actually queue connections at 17, not serve 30 at once. This would be different for Node or JBoss of course.

Comment 4 Luke Meyer 2013-12-27 14:43:05 UTC
My logs showed that there was indeed a lot of flapping, and it wasn't even as orderly as their results suggested. Under constant traffic, it would sometimes scale up or down more than once in a row. It's clearly not just crossing back and forth over a threshold.

The reason appears to be that the statistic compared against thresholds - current http sessions on HAproxy - is fluctuating wildly. There is another number reported in HAproxy stats that appears to be the max connections over some period and much more consistent, but I haven't been able to find how to configure the period length so this can scale down in a reasonable interval. Also I worry about using that number as it would be pretty easy for an attacker to game with small spurts of traffic. But I don't see a better number available.

The fluctuation could be something specific to my test system, but it's a troubling result. I'll bring it up with the HAproxy experts on the team to see what might be done.

Comment 5 Luke Meyer 2014-01-08 17:57:12 UTC
I just want to bring this up to date with some illumination per internal conversations.

As an overview, the path a request takes goes through several tiers, and each introduces potential complexities.

Tier 1 is the node host httpd proxy (for our purposes anyway - other front end proxies could be tested).
Tier 2 is the application LB gear HAproxy.
Tier 3 is the port proxy for the receiving gear (bypassed if routed to the same gear)
Tier 4 is the application itself (which may consist of further tiers - httpd, app framework, DB, ...).

Traffic measurement and the scaling decision takes place at tier 2 gear HAproxy, but obviously this is affected by the other tiers. There are a number of issues to be conscious of.

Issue 1: httpd blocking
In my testing, I found that for long periods (on the order of 10 seconds) Tier 1 httpd would stop sending requests to Tier 2 HAproxy (503 response). It is not yet clear to me why this happened - perhaps something to do with how proxy workers or network sockets are allocated. I believe this is what caused the sort of flap I saw where it would gear up twice and then down twice - periods of "0 traffic" at tier 2 alternated with more representative throughput.  However this does not seem like the results the customer saw, so it may have just been something peculiar to my deployment.

Issue 2: no keepalive
Keepalive requests to Tier 1 were not kept alive; the response comes back with "Connection: close" headers effectively limiting each connection to 1 request. I believe this is because requests from the tier 1 proxy to tier 2 are not kept alive (which probably is the right choice) but it seems to me the connection to tier 1 could remain open; it would be worth researching if there is a config option for this just on the grounds of app performance. The side effect here is that more time is spent creating connections to tier 1 when trying to drive traffic, meaning less time where an actual request gets to tier 2, and increases how erratic the traffic measurement there seems compared to the concurrency deployed against tier 1.

Issue 3: haproxy stats
If we run our load testing directly against tier 2 HAproxy (where keepalive is actually kept alive), the traffic measurement is still fairly erratic. How erratic has much to do with the length of the request; apparently, the only useful number we have for number of concurrent requests only accounts the actual number of requests in flight. It does not count current TCP *connections*. So when the actual request is extremely fast, as it generally would be if someone just created an example app and tested the index page, the amount of time spent servicing a request from tier 2's perspective is a fraction of the total time the client takes in servicing the request (which includes forming the request, possibly connecting, sending it, receiving and processing the response, possibly disconnecting), and many TCP connections will not be counted as active requests.

To get a more representative result here, it helps to introduce some delay in the app's response, say 50 or 100 ms instead of the 2-5 ms typically seen at the default app index. Measured request concurrency stays closer to client concurrency the longer the actual requests take to service; in the case of a real application experiencing overload and slowdown (not to mention contention with gears from other applications), this would perform reasonably well, but in the most natural load testing scenarios with short requests, it appears totally erratic. So PoCs would be advised to introduce some delay or use a real app with representative requests to test the auto-scaling. In many cases, manual scaling is a better fit. Work is also proceeding regarding allowing the cartridge user to plug in their own scaling algorithm, though this is apparently not ready for prime time just now.

Even with requests on the order of 50ms, the kind of variation seen in the concurrency measurement at tier 2 could be enough to induce occasional flapping across gear up/down thresholds. For this reason, we've created a story to use a moving average of multiple measurements instead of the current "spot check" - https://trello.com/c/yAdClbTm/101-3-use-moving-averages-for-scale-up-scale-down-calculations-in-ha-proxy (should be public to all Trello users)

Issue 4: blocking sockets
There were two suggested directions of inquiry around sockets. One is to see if there were a lot of sockets in TIME_WAIT, which there were whether sending traffic to tier 1 or tier 2. The other was an issue reporting in OpenShift Online where the nofile ulimit actually blocks tier 2 HAproxy from creating enough connections to tier 3, resulting in poor utilization of child gears (Bug 971610). It was not clear to me that these had any bearing at the level of concurrency I was testing with (50-100), but I mention them for completeness. 

There are other tuning factors that could be examined, like the server limits on tier 1 and tier 4 httpd instances, tcp settings (http://serverfault.com/questions/212093/how-to-reduce-number-of-sockets-in-time-wait) etc.

Comment 6 Dan McPherson 2014-01-09 18:38:21 UTC
This PR should help a lot:

https://github.com/openshift/origin-server/pull/4438

Comment 8 Red Hat Bugzilla 2023-09-14 01:54:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.