Description of problem: While "monitoring" a scalable app with ab, manually scaling it up results in a lot of 503 errors returned, both during and after the scaling. How reproducible: Not 100% but quite frequent Steps to Reproduce: 1. rhc app create phps php-5.4 -s 2. (in another window) tail -f /var/log/httpd/openshift_log | grep ' 503 ' 3. (in another window) ab -c 2 -n 100000 http://phps-lmeyer.dev.rhcloud.com/ 4. rhc cartridge scale php -a phps --min 4 Actual results: Sometimes it goes well. Often there are vast quantities of 503 errors in the logs and "ab" reports errors (without being specific as to code). I have even since this continue long after the scaling is complete. Expected results: No 503 errors. Additional info: I have enabled haproxy logging and it will never show an error. I have also pointed ab at the private gear IP and run into connection errors. So it seems that haproxy goes down for a while due to scaling, or just doesn't restart very gracefully.
Example logs - haproxy syslog is enabled and directed to /var/log/haproxy.log, and I observed that between 09:32:51.729 and 09:33:02.937, the frontend returned 503s and haproxy logged nothing. This was more than a minute after the app scaled up due to setting the min to 4. /var/log/httpd/error_log: [Tue Jun 17 09:33:02 2014] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.83.130:8080 (*) failed /var/log/httpd/openshift_log 172.16.4.136 phpsmodrw-demo.ose211.example.com - - [17/Jun/2014:09:33:02 -0400] "GET / HTTP/1.0" 503 423 "-" "cmdlnab" (7917us) - 127.4.83.130:8080/ /var/log/haproxy.log: Next entry after the 9-second outage: Jun 17 09:33:02 localhost haproxy[31344]: 127.4.83.130:28414 [17/Jun/2014:09:33:02.937] express express/gear-53a041f255fae4e4230000e7-demo 0/0/0/1/1 200 39801 - - --NI 1/1/0/1/0 0/0 "GET / HTTP/1.1" (nothing indicating haproxy reload/restart or anything) It suggests to me that haproxy is still listening all along, but the frontend is unable to connect to it; perhaps because of existing connections being held open for keepalive, or some other network condition. It doesn't seem related to recent scaling events. Here are the logs for the same app, unchanged at 4 gears, many minutes later: /var/log/haproxy.log: Jun 17 10:13:52 localhost haproxy[31344]: 127.4.83.130:28547 [17/Jun/2014:10:13:52.618] express express/gear-53a041f255fae4e4230000e6-demo 0/0/0/0/1 200 39801 - - --NI 2/2/0/1/0 0/0 "GET / HTTP/1.1" Jun 17 10:14:00 localhost haproxy[31344]: 127.4.83.130:30375 [17/Jun/2014:10:14:00.177] express express/gear-53a041f255fae4e4230000e7-demo 0/0/0/1/1 200 39801 - - --NI 1/1/0/1/0 0/0 "GET / HTTP/1.1" /var/log/httpd/openshift_log: 172.16.4.136 phpsmodrw-demo.ose211.example.com - - [17/Jun/2014:10:13:52 -0400] "GET / HTTP/1.0" 200 39627 "-" "cmdlnab" (4842us) - 127.4.83.130:8080/ 172.16.4.136 phpsmodrw-demo.ose211.example.com - - [17/Jun/2014:10:13:52 -0400] "GET / HTTP/1.0" 503 423 "-" "cmdlnab" (8436us) - 127.4.83.130:8080/ [...] 172.16.4.136 phpsmodrw-demo.ose211.example.com - - [17/Jun/2014:10:14:00 -0400] "GET / HTTP/1.0" 503 423 "-" "cmdlnab" (17244us) - 127.4.83.130:8080/ 172.16.4.136 phpsmodrw-demo.ose211.example.com - - [17/Jun/2014:10:14:00 -0400] "GET / HTTP/1.0" 200 39627 "-" "cmdlnab" (3824us) - 127.4.83.130:8080/ /var/log/httpd/error_log: [Tue Jun 17 10:13:45 2014] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.83.130:8080 (*) failed [Tue Jun 17 10:13:52 2014] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.83.130:8080 (*) failed [...] [Tue Jun 17 10:14:00 2014] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.83.130:8080 (*) failed <EOF> At least one user reported this with OSE 2.0 on our user@ list and on stackoverflow: http://stackoverflow.com/questions/22645040/service-temporarily-unavailable-under-load-in-openshift-enterprise-2-0 Note that I am always running ab with concurrency of just 2.
This never seems to happen with a single-gear app (scaled or non-scaled). Scaling up or down by multiple gears in succession seems to produce it more often. Watching for TIME_WAIT connections under this load is interesting. watch 'netstat -pan --tcp | grep -c ".*TIME_WAIT"' Typically this will be on the order of 10,000. Sometimes it is up closer to 18,000 or 20,000. It may be noteworthy that the ephemeral port range has about 20,500 ports in it.
It doesn't seem to be a simple matter of the frontend running out of connections or another resource, either. If I run the same ab against two apps, one that I'm scaling up and down repeatedly and the other just static, the scaling one can have lots of errors while the other has none.
(Should mention an assumption: all of the gears I'm discussing are on the same node host.)
Two things - 1. The 503 happens when we see this apache error_log : [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.83.130:8080 (*) failed I have been able to reproduce this for the app even when its not scaling. So taking a cue from bz1085115 I did this : sysctl net.ipv4.tcp_tw_reuse=1 No more errors. And yeah that is irrespective of whether the app is scalable or not as well. 2. After the fix in (1) above, the error still happens when a scalable app tries to scale up/down. The number of errors come down considerably (6 errors compared to hundreds of refused connections before). This is somewhat of a known limitation with haproxy trying to handover port listening from old process to the new one, see this -> http://stackoverflow.com/questions/21595534/hot-reconfiguration-of-haproxy-still-lead-to-failed-request-any-suggestions So we have all been waiting for SO_REUSEPORT to get implemented and it is there in kernel 3.9+, but our current dev-envs are RHEL 6 (2.6.32-431.17.1.el6.x86_64). RHEL7 is based on 3.10 so I guess this issue will stay until we see RHEL7. Has someone reported this on latest Fedora?
(In reply to Rajat Chopra from comment #6) > I have been able to reproduce this for the app even when its not scaling. So > taking a cue from bz1085115 I did this : sysctl net.ipv4.tcp_tw_reuse=1 > > No more errors. And yeah that is irrespective of whether the app is scalable > or not as well. If the (main) problem is simply the number of TIME_WAIT connections, then there are a few approaches to improving this; setting reuse is just one of them, and it feels a little risky to just recommend doing this across the board on our nodes. There must be a reason why this is not the default setting. I was unable to determine what exactly happens when a TIME_WAIT connection is reused "too soon" and extra packets end up coming down the line after the first FIN/ACK is sent. Is the packet content rejected at the TCP layer or injected randomly into the new connection's stream? The latter seems bad, the former manageable. http://serverfault.com/questions/234534/is-it-dangerous-to-change-the-value-of-proc-sys-net-ipv4-tcp-tw-reuse didn't get a conclusive response. It seems safer if we were able to scope the reuse to just the host proxies. With no keepalives we can be reasonably confident these connections are safe to reuse after close. I'm not sure we have the same confidence for connections in general (e.g. requests made by gear haproxy, db connections, random outgoing requests...). This SF answer indicated a possible way to scope connection reuse to an app: http://serverfault.com/a/449242 I'd also like to ensure we understand the exact scope of what is being exhausted when this problem arises. As I noted, you can send a continuous stream of traffic to two apps on the same node and have one go through many seconds of 503s while the other is having no problem. So it can't be the case that systemwide, connections are exhausted. Is the limit on connections scoped to the interface (so 127.1.2.3 can have all its connections used while 127.1.2.4 doesn't)? Is it the limit on public interface ephemeral ports that we're hitting (in which case it doesn't seem like you'd ever see this on a non-scaled app)? Or on public interface connections? Or something else? Another approach would be to just lower the timeout on TIME_WAIT connections. The default is two minutes. If we lowered it to, say, 30 seconds, or 10 or 5, then used connections would become available much faster, with still very little risk of more packets showing up later. E.g.: sysctl net.netfilter.nf_conntrack_tcp_timeout_time_wait=10 Also why are connections staying in TIME_WAIT for so long? Shouldn't the second half of the duplex close pretty quickly on a local connection? Aside from connection reuse at the system level, GSS has suggested implementing keepalives and connection pooling from the frontend. This is possible under the right circumstances but those circumstances can be difficult to arrange and may come with another set of problems. Any other thoughts on how to approach this? I don't think this is such a problem under real life workloads (where the app is doing something substantial, we aren't just hitting it as fast as it returns, and the scaled app is spread across nodes), but it certainly looks bad to customers testing scaling on trivial apps on one node. > 2. After the fix in (1) above, the error still happens when a scalable app > tries to scale up/down. The number of errors come down considerably (6 > errors compared to hundreds of refused connections before). This is somewhat Actually, if I do "sysctl net.ipv4.tcp_tw_reuse=1" on the node, set up a loop to scale an app up and down between 1 and 6 gears, and run ab against it with -c 2, I *still* see large swaths of hundreds of 503 errors. It doesn't seem to occur with 6 gears statically there, just with the scaling. If it were all about connection exhaustion, it doesn't seem like that should happen. > of a known limitation with haproxy trying to handover port listening from > old process to the new one, see this -> > > http://stackoverflow.com/questions/21595534/hot-reconfiguration-of-haproxy- > still-lead-to-failed-request-any-suggestions > > So we have all been waiting for SO_REUSEPORT to get implemented and it is > there in kernel 3.9+, but our current dev-envs are RHEL 6 > (2.6.32-431.17.1.el6.x86_64). Actually, it looks like SO_REUSEPORT is available as of RHEL 6.5 (http://stackoverflow.com/a/24234083/262768). Does haproxy need to be rebuilt to take advantage? What about httpd? (I *think* httpd never drops connection from the parent process during graceful restart so probably not necessary.) I think this is of much less concern than having seconds of continuous errors though.
The haproxy error is not related to connection exhaustion. I dont think we need a recompiled version of Haproxy. And anyway there is still a small window where connections would end up getting dropped it seems. Back to net.ipv4.tcp_tw_reuse=1, this was a helpful post -http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html I experimented with several flags and values - net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 20 net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 10 net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 20 net.netfilter.nf_conntrack_tcp_timeout_close_wait = 10 net.netfilter.nf_conntrack_tcp_timeout_last_ack = 5 net.netfilter.nf_conntrack_tcp_timeout_time_wait = 20 net.netfilter.nf_conntrack_tcp_timeout_close = 10 No luck. One more datapoint is that we seem to suffer with these errors only when 'ab' itself is run locally (which means it may be consuming a big share of exhausted ephemeral ports). Finally here is why I think we can still try tw_reuse=1 (as against recycle=1) - because we seem to get the exhaustion only when our proxy httpd connects to the gears. And I doubt if we will ever run into late packets in this channel. Like you say, it would be best if this particular process only could adhere to the reuse flag and not the rest, but setting it system-wide as well does seem to be harmless. The only risk is that the outgoing connections from gears may get affected. From what I understand, that may still be ok because the timestamping will solve the issue of late packets. I need to be sure that its not a case of NAT tuples happening, as mentioned in the post above. Theoretically it seems alright, but will get back with more data.
To summarize our discussions since the last entry: I think we can reasonably conclude that enabling tw_reuse will go a long way to address the issue of ephemeral port exhaustion. It appears to be totally safe in this context and we should simply always do so on OpenShift node hosts. Other netfilter-based options don't seem to have any effect (if I'm understanding your results and what the article says). The other understood source of 503s is when haproxy gives up its binding briefly and reclaims it after reloading its configuration. This only results in a few milliseconds of outage at most per scaling event, and while we'd like even that to go away, it's not worth a great deal of effort, or at least less of a concern than the much larger outages seen. This still leaves a non-understood source of 503s related specifically to scaling events. By scaling an app up and down repeatedly and sending it a continuous stream of traffic (under ideal conditions - tw_reuse enabled, ab running from outside the node), we still occasionally (5-10% of scaling events?) see many seconds of 503 responses. Meanwhile continuous requests to another scaled app (not being actively scaled) on the same node host all succeed, indicating ephemeral port exhaustion isn't an issue, instead being specifically related to scaling. Of course, this is exactly the situation that customers are likely to encounter when running benchmark tests to get a trivial app to scale.
I think I may have a handle on what's happening. And as it turns out, I think it's not something to worry about. First, to be clear about the setup: 1. Broker is set up separately from a single node 2. Node is using the apache-vhost frontend 3. tw_reuse has been set to 1 4. "ab" is running on the broker against the index of a scaled PHP app on the node (no code, so it returns the default content with no delay) 5. App's haproxy is set up to log to node syslog via tcp, syslog configured to accept 6. Running rhc scale-cartridge --min --max to scale between 1 and 4 PHP gears in a loop The problem was that I was *still* seeing long stretches of 503 errors in the frontend logs. They were just sporadic. It's actually illuminating to try this specific scenario *without* any load on the app. Here's what you see repeated every few minutes in the frontend log: 127.0.0.1 53bda3b055fae4910f00005e-demo.ose211.example.com - - [09/Jul/2014:16:19:57 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (818us) - 127.0.0.1 53bda3b055fae4910f00005e-demo.ose211.example.com - - [09/Jul/2014:16:19:57 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (830us) - 127.0.0.1 53bda3b055fae4910f00005e-demo.ose211.example.com - - [09/Jul/2014:16:19:57 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (818us) - 127.0.0.1 53bda3b055fae4910f00005f-demo.ose211.example.com - - [09/Jul/2014:16:20:03 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (1027us) - 127.0.0.1 53bda3b055fae4910f00005f-demo.ose211.example.com - - [09/Jul/2014:16:20:03 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (799us) - 127.0.0.1 53bda3b055fae4910f00005f-demo.ose211.example.com - - [09/Jul/2014:16:20:03 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (392us) - 127.0.0.1 53bda3b055fae4910f000060-demo.ose211.example.com - - [09/Jul/2014:16:20:06 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (1655us) - 127.0.0.1 53bda3b055fae4910f000060-demo.ose211.example.com - - [09/Jul/2014:16:20:06 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (359us) - 127.0.0.1 53bda3b055fae4910f000060-demo.ose211.example.com - - [09/Jul/2014:16:20:06 -0400] "HEAD / HTTP/1.1" 503 - "-" "Ruby" (330us) - Three failed requests each to three secondary gears of the app; "Ruby" is listed as the agent. This is haproxy health-checking the gears (using their frontend, not the port-proxy) and getting three failures each. This of course results in haproxy taking them out of rotation. Here's the corresponding haproxy logs (not sure why the timestamps don't quite line up - "Layer4 connection problem" may indicate it is trying a port-proxy connection): Jul 9 16:20:43 localhost haproxy[9347]: Server express/gear-53bda3b055fae4910f00005e-demo is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. Jul 9 16:20:47 localhost haproxy[9347]: Server express/gear-53bda3b055fae4910f00005f-demo is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. Jul 9 16:20:52 localhost haproxy[9347]: Server express/gear-53bda3b055fae4910f000060-demo is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. Jul 9 16:20:52 localhost haproxy[9347]: proxy express has no server available! Jul 9 16:21:09 localhost haproxy[9347]: 127.7.115.2:18391 [09/Jul/2014:16:21:09.731] express express/<NOSRV> 0/-1/-1/-1/0 503 212 - - SCNN 0/0/0/0/0 0/0 "GET / HTTP/1.1" Jul 9 16:21:09 localhost haproxy[10435]: Proxy stats started. Jul 9 16:21:09 localhost haproxy[10435]: Proxy express started. Jul 9 16:21:09 localhost haproxy[9347]: Stopping proxy stats in 0 ms. Jul 9 16:21:09 localhost haproxy[9347]: Stopping proxy express in 0 ms. Jul 9 16:21:09 localhost haproxy[9347]: Proxy stats stopped (FE: 1 conns, BE: 0 conns). Jul 9 16:21:09 localhost haproxy[9347]: Proxy express stopped (FE: 1 conns, BE: 1 conns). Jul 9 16:21:09 localhost haproxy[10435]: 127.7.115.3:34627 [09/Jul/2014:16:21:09.919] stats stats/<STATS> 0/-1/-1/-1/0 200 978 - - PR-- 0/0/0/0/0 0/0 "GET /haproxy-status/;csv HTTP/1.1" This actually tells the whole story, somewhat obscurely. What seems to happen is that when scaling from 4 gears to 1, all three moribund gears are stopped before haproxy is informed of the updated configuration. Of course, with four gears active, the local gear backend has been disabled in haproxy conf; with the other three gears disappearing at once, haproxy has nowhere to route requests and returns 503 results until its configuration is rewritten and it starts up with the new configuration (removing the three defunct gears and allowing it to route locally). The handoff between configs happens above when haproxy[9347] hands off to haproxy[10435]. If, instead of scaling between 1 and 4, we scale between 2 and 5, then no interruption is seen. At all. As far as I can tell, the haproxy handoff leaves no room for dropped connections. Worthy of note: the same problem would occur in a HA app scaling directly from 5 to 2 gears, since the first two gears are rotated out with the 5th gear. In any case - this would be an unlikely scenario to encounter in real life. It will essentially only happen when manually scaling the app from 3+ gears directly to --max 1 (or in a HA app, from 5 to 2). If we would like to fix that, it may make sense to have scaling down update haproxy configuration prior to stopping the gears. This would be nice in order to prevent having to explain this to some customer that does the same test I stumbled on. The problem would not occur with auto-scaling, because that scales down by 1 gear at a time. It doesn't happen at all with scaling up, which is the primary concern.
*** Bug 1071250 has been marked as a duplicate of this bug. ***