Description of problem: Router service suffers short but frequent outages So we tried -> command used for monitoring watch -d -n 3 "echo show info | docker exec -i ebcef8c42d0b socat stdio /var/lib/haproxy/run/haproxy.sock | egrep '(Conn|Uptime|Pid)'; ./hanet.sh" contents from hanet.sh #!/bin/bash count=0 prevstat= netstat -tnp | grep haproxy | grep haproxy | awk '{ print $6,$7,$5 }' | sort | while read status pid remote; do if [[ "$prevstat" != "$status" ]]; then [[ "$prevstat" != "" ]] && echo "$prevstat $count" count=0 prevstat="$status" fi (( ++count )) done after 2000 currconnections - Every 3.0s: echo show info | docker exec -i ebcef8c42d0b socat stdio /var/lib/haproxy/run/haproxy.... Fri May 13 17:33:36 2016 Pid: 93502 Uptime: 0d 0h07m01s Uptime_sec: 421 CurrConns: 2000 CumConns: 9264 MaxSslConns: 0 CurrSslConns: 1 CumSslConns: 4273 ConnRate: 22 ConnRateLimit: 0 MaxConnRate: 208 CLOSE_WAIT 9479 ESTABLISHED 557 FIN_WAIT1 124 FIN_WAIT2 9360 After 2000 connections are reached after 6 minutes, the ha proxy router gets stuck in the sense the requests to api from browser are in loading phase only. Then after sometime we see that a new process of haproxy gets forked and things start over again. Also we saw that when we were reaching to 2000 the connections were not dropping as well, i mean they did but only at sometimes which was of not any help. We changed the following settings- net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1 net.core.somaxconn=2048 from 128 default which didn't help But now the question is even if we increase maxconns in the haproxy will it help? Also for that you would have to change the template and recompile the image maxconn 2000 concurrent connections usually thats enough. If set it to 5000 can be too high. You cannot keep the number too high.. e.g. 5k connections on a machine with 4G ram is probably going to get it stuck there's something else going on... with 160ms per connection you get 6 per second per connection... so with 2000 sockets... that's a lot of parallel connections that should be able to be used. way more than the 20/sec would indicating We are exhausting 2000 connections with 160ms transactions ... :) So, we also did - ssh to the node, get the pid of the router container, then do nsenter -t <pid> -n; then do netstat -tunlp Also we checked the netstat_aptn output which is attached to the ticket Version-Release number of selected component (if applicable): OpenShift version: v3.0.2.0-39-ga2d18e6 Router image: registry.access.redhat.com/openshift3/ose-haproxy-router:v3.1.1.6 The service has a dedicated node in a fourteen-node cluster where productive applications are deployed. How reproducible: Always Steps to Reproduce: 1.Mentioned in description 2. 3. Actual results: The HA proxy gets stuck after reaching 2000 i.e max connections Expected results: The HA proxy gets stuck after reaching 2000 i.e max connections Additional info:
*** This bug has been marked as a duplicate of bug 1320233 ***