Created attachment 831603 [details] Document containing their test results Description of problem: Hello All, We had a call with Orange ( doing a PoC evaluation ) on Fri Nov 29, 2031 and here is the summary of it. They are facing a problem with OpenShift (OS) entering in a loop state where it keeps adding a gear and removing gears even though the load is constant. Here is our understanding of the issue with following Parameters and Assumptions : Application is hit with 100 requests / second constantly without change over a period of time. Example Gear threshold is 30 ( i.e. number of requests served / per gear ) Given the above numbers : OpenShift will keep on adding gears to a scalable application until it reaches the number 4. 3 gears will serve 90 req / sec which is not enough. However, with 4 gears total capacity is much higher than needed. And so it will scale down to 3 again. Result is that it will constantly create the fourth gear and remove it and then add it and then remove it ………… A simple logical idea to fix it might be to say - if a scale down event will trigger a scale up event based on current load then do not do it. Hope that helps to clarify the situation and what they are facing. Please let me know how would we like to proceed here ? I have convinced them that this fix is not needed for Dec 11 event and they are fine with that. But, the still want and expect an answer from us on when will we be able to fix it. Thanks, Shashin. Refer to the following bugzilla for how to test it. https://bugzilla.redhat.com/show_bug.cgi?id=1035613
If traffic is expected to be sustained for a long period of time I think it would be best to set the min-scale for the web cartridge to what they want. That way they can allow OpenShift to autoscale whenever needed (beyond what they expect).
I'm not sure I'm getting the same results from looking at the math. Can you see something I'm doing wrong? If the gear capacity is 16, here are the gear up/down thresholds at each number of web framework gears: # 1 2 3 4 5 6 7 ... up 15 29 44 58 72 87 101 down na 14 28 40 54 67 81 gdt .405 .585 .630 .675 .705 .726 (Gear up is when 90% of capacity is taken. Gear down threshold is more complicated, multiplier shown as gdt.) Notice that gear down threshold is always below the previous gear up threshold, meaning that if it's low enough to trigger the gear down, it's too low to trigger gear up, just like they expect. With gear capacity at 30, the table looks different: # 1 2 3 4 5 6 7 ... up 27 54 81 108 135 162 189 down na 24 52 75 101 126 152 gdt .405 .585 .630 .675 .705 .726 ... but it still looks like the thresholds overlap so there should be no flapping. To test what happens in reality, I'm going to set gear capacity to 30 and run this for a while: ab -ki -c 100 -n 9999999 http://scaleme-demo.ose201.example.com/ I'll have a look at the logs tomorrow. A couple of things to mention: 1. By setting the gear capacity to 30, they're modifying the algorithm. In general, if they're going to do that, they may need to fix the rest of the algorithm to match. It's not clear to me whether there's ever a capacity at which mathematically there are holes between the thresholds causing flap. But I don't think it's wise to just adjust one number without looking at all the effects. 2. The gears themselves decide how many clients to allow based on per-gear RAM. For those that leverage httpd (like PHP), a standard small gear actually sets MaxClients to 17 (have a look at php/conf/performance.conf in a PHP app). So gears will actually queue connections at 17, not serve 30 at once. This would be different for Node or JBoss of course.
My logs showed that there was indeed a lot of flapping, and it wasn't even as orderly as their results suggested. Under constant traffic, it would sometimes scale up or down more than once in a row. It's clearly not just crossing back and forth over a threshold. The reason appears to be that the statistic compared against thresholds - current http sessions on HAproxy - is fluctuating wildly. There is another number reported in HAproxy stats that appears to be the max connections over some period and much more consistent, but I haven't been able to find how to configure the period length so this can scale down in a reasonable interval. Also I worry about using that number as it would be pretty easy for an attacker to game with small spurts of traffic. But I don't see a better number available. The fluctuation could be something specific to my test system, but it's a troubling result. I'll bring it up with the HAproxy experts on the team to see what might be done.
I just want to bring this up to date with some illumination per internal conversations. As an overview, the path a request takes goes through several tiers, and each introduces potential complexities. Tier 1 is the node host httpd proxy (for our purposes anyway - other front end proxies could be tested). Tier 2 is the application LB gear HAproxy. Tier 3 is the port proxy for the receiving gear (bypassed if routed to the same gear) Tier 4 is the application itself (which may consist of further tiers - httpd, app framework, DB, ...). Traffic measurement and the scaling decision takes place at tier 2 gear HAproxy, but obviously this is affected by the other tiers. There are a number of issues to be conscious of. Issue 1: httpd blocking In my testing, I found that for long periods (on the order of 10 seconds) Tier 1 httpd would stop sending requests to Tier 2 HAproxy (503 response). It is not yet clear to me why this happened - perhaps something to do with how proxy workers or network sockets are allocated. I believe this is what caused the sort of flap I saw where it would gear up twice and then down twice - periods of "0 traffic" at tier 2 alternated with more representative throughput. However this does not seem like the results the customer saw, so it may have just been something peculiar to my deployment. Issue 2: no keepalive Keepalive requests to Tier 1 were not kept alive; the response comes back with "Connection: close" headers effectively limiting each connection to 1 request. I believe this is because requests from the tier 1 proxy to tier 2 are not kept alive (which probably is the right choice) but it seems to me the connection to tier 1 could remain open; it would be worth researching if there is a config option for this just on the grounds of app performance. The side effect here is that more time is spent creating connections to tier 1 when trying to drive traffic, meaning less time where an actual request gets to tier 2, and increases how erratic the traffic measurement there seems compared to the concurrency deployed against tier 1. Issue 3: haproxy stats If we run our load testing directly against tier 2 HAproxy (where keepalive is actually kept alive), the traffic measurement is still fairly erratic. How erratic has much to do with the length of the request; apparently, the only useful number we have for number of concurrent requests only accounts the actual number of requests in flight. It does not count current TCP *connections*. So when the actual request is extremely fast, as it generally would be if someone just created an example app and tested the index page, the amount of time spent servicing a request from tier 2's perspective is a fraction of the total time the client takes in servicing the request (which includes forming the request, possibly connecting, sending it, receiving and processing the response, possibly disconnecting), and many TCP connections will not be counted as active requests. To get a more representative result here, it helps to introduce some delay in the app's response, say 50 or 100 ms instead of the 2-5 ms typically seen at the default app index. Measured request concurrency stays closer to client concurrency the longer the actual requests take to service; in the case of a real application experiencing overload and slowdown (not to mention contention with gears from other applications), this would perform reasonably well, but in the most natural load testing scenarios with short requests, it appears totally erratic. So PoCs would be advised to introduce some delay or use a real app with representative requests to test the auto-scaling. In many cases, manual scaling is a better fit. Work is also proceeding regarding allowing the cartridge user to plug in their own scaling algorithm, though this is apparently not ready for prime time just now. Even with requests on the order of 50ms, the kind of variation seen in the concurrency measurement at tier 2 could be enough to induce occasional flapping across gear up/down thresholds. For this reason, we've created a story to use a moving average of multiple measurements instead of the current "spot check" - https://trello.com/c/yAdClbTm/101-3-use-moving-averages-for-scale-up-scale-down-calculations-in-ha-proxy (should be public to all Trello users) Issue 4: blocking sockets There were two suggested directions of inquiry around sockets. One is to see if there were a lot of sockets in TIME_WAIT, which there were whether sending traffic to tier 1 or tier 2. The other was an issue reporting in OpenShift Online where the nofile ulimit actually blocks tier 2 HAproxy from creating enough connections to tier 3, resulting in poor utilization of child gears (Bug 971610). It was not clear to me that these had any bearing at the level of concurrency I was testing with (50-100), but I mention them for completeness. There are other tuning factors that could be examined, like the server limits on tier 1 and tier 4 httpd instances, tcp settings (http://serverfault.com/questions/212093/how-to-reduce-number-of-sockets-in-time-wait) etc.
This PR should help a lot: https://github.com/openshift/origin-server/pull/4438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days