Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1336009

Summary: Router service suffers short but frequent outages
Product: OpenShift Container Platform Reporter: Miheer Salunke <misalunk>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, misalunk
Version: 3.1.0   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-14 14:28:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Miheer Salunke 2016-05-13 19:47:46 UTC
Description of problem:
Router service suffers short but frequent outages

So we tried ->

command used for monitoring
watch -d -n 3 "echo show info | docker exec -i ebcef8c42d0b socat stdio /var/lib/haproxy/run/haproxy.sock | egrep '(Conn|Uptime|Pid)'; ./hanet.sh"
contents from hanet.sh
#!/bin/bash

count=0
prevstat=

netstat -tnp | grep haproxy | grep haproxy | awk '{ print $6,$7,$5 }' | sort |

while read status pid remote; do
  if [[ "$prevstat" != "$status" ]]; then
    [[ "$prevstat" != "" ]] && echo "$prevstat $count"
    count=0
    prevstat="$status"
  fi

  (( ++count ))
done


after 2000 currconnections -
Every 3.0s: echo show info | docker exec -i ebcef8c42d0b socat stdio /var/lib/haproxy/run/haproxy....  Fri May 13 17:33:36 2016

Pid: 93502
Uptime: 0d 0h07m01s
Uptime_sec: 421
CurrConns: 2000
CumConns: 9264
MaxSslConns: 0
CurrSslConns: 1
CumSslConns: 4273
ConnRate: 22
ConnRateLimit: 0
MaxConnRate: 208
CLOSE_WAIT 9479
ESTABLISHED 557
FIN_WAIT1 124
FIN_WAIT2 9360



After 2000 connections are reached after 6 minutes, the ha proxy router gets stuck in the sense the requests to api from browser are in loading phase only.

Then after sometime we see that a new process of haproxy gets forked and things start over again. Also we saw that when we were reaching to 2000 the connections were not dropping as well, i mean they did but only at sometimes which was of not any help.

We changed the following settings-
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.core.somaxconn=2048    from 128 default

which didn't help

But now the question is even if we increase maxconns in the haproxy will it help?  Also for that you would have to change the template and recompile the image

maxconn 2000 concurrent connections usually thats enough. If set it to 5000 can be too high. You cannot keep the number too high.. e.g. 5k connections on a machine with 4G ram is probably going to get it stuck

there's something else going on... with 160ms per connection you get 6 per second per connection... so with 2000 sockets... that's a lot of parallel connections that should be able to be used.  way more than the 20/sec would indicating

We are exhausting 2000 connections with 160ms transactions ... :)

So, we also did -
ssh to the node, get the pid of the router container, then do nsenter -t <pid> -n; then do netstat -tunlp

Also we checked the  netstat_aptn output which is attached to the ticket

Version-Release number of selected component (if applicable):
OpenShift version: v3.0.2.0-39-ga2d18e6
Router image: registry.access.redhat.com/openshift3/ose-haproxy-router:v3.1.1.6
The service has a dedicated node in a fourteen-node cluster where productive applications are deployed.


How reproducible:
Always

Steps to Reproduce:
1.Mentioned in description
2.
3.

Actual results:
The HA proxy gets stuck after reaching 2000 i.e max connections

Expected results:
The HA proxy gets stuck after reaching 2000 i.e max connections

Additional info:

Comment 8 Ben Bennett 2016-06-14 14:28:49 UTC

*** This bug has been marked as a duplicate of bug 1320233 ***