1336009 – Router service suffers short but frequent outages

Bug 1336009 - Router service suffers short but frequent outages

Summary: Router service suffers short but frequent outages

Keywords:
Status:	CLOSED DUPLICATE of bug 1320233
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-13 19:47 UTC by Miheer Salunke
Modified:	2022-08-04 22:20 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-06-14 14:28:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Miheer Salunke 2016-05-13 19:47:46 UTC

Description of problem:
Router service suffers short but frequent outages

So we tried ->

command used for monitoring
watch -d -n 3 "echo show info | docker exec -i ebcef8c42d0b socat stdio /var/lib/haproxy/run/haproxy.sock | egrep '(Conn|Uptime|Pid)'; ./hanet.sh"
contents from hanet.sh
#!/bin/bash

count=0
prevstat=

netstat -tnp | grep haproxy | grep haproxy | awk '{ print $6,$7,$5 }' | sort |

while read status pid remote; do
  if [[ "$prevstat" != "$status" ]]; then
    [[ "$prevstat" != "" ]] && echo "$prevstat $count"
    count=0
    prevstat="$status"
  fi

  (( ++count ))
done


after 2000 currconnections -
Every 3.0s: echo show info | docker exec -i ebcef8c42d0b socat stdio /var/lib/haproxy/run/haproxy....  Fri May 13 17:33:36 2016

Pid: 93502
Uptime: 0d 0h07m01s
Uptime_sec: 421
CurrConns: 2000
CumConns: 9264
MaxSslConns: 0
CurrSslConns: 1
CumSslConns: 4273
ConnRate: 22
ConnRateLimit: 0
MaxConnRate: 208
CLOSE_WAIT 9479
ESTABLISHED 557
FIN_WAIT1 124
FIN_WAIT2 9360



After 2000 connections are reached after 6 minutes, the ha proxy router gets stuck in the sense the requests to api from browser are in loading phase only.

Then after sometime we see that a new process of haproxy gets forked and things start over again. Also we saw that when we were reaching to 2000 the connections were not dropping as well, i mean they did but only at sometimes which was of not any help.

We changed the following settings-
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.core.somaxconn=2048    from 128 default

which didn't help

But now the question is even if we increase maxconns in the haproxy will it help?  Also for that you would have to change the template and recompile the image

maxconn 2000 concurrent connections usually thats enough. If set it to 5000 can be too high. You cannot keep the number too high.. e.g. 5k connections on a machine with 4G ram is probably going to get it stuck

there's something else going on... with 160ms per connection you get 6 per second per connection... so with 2000 sockets... that's a lot of parallel connections that should be able to be used.  way more than the 20/sec would indicating

We are exhausting 2000 connections with 160ms transactions ... :)

So, we also did -
ssh to the node, get the pid of the router container, then do nsenter -t <pid> -n; then do netstat -tunlp

Also we checked the  netstat_aptn output which is attached to the ticket

Version-Release number of selected component (if applicable):
OpenShift version: v3.0.2.0-39-ga2d18e6
Router image: registry.access.redhat.com/openshift3/ose-haproxy-router:v3.1.1.6
The service has a dedicated node in a fourteen-node cluster where productive applications are deployed.


How reproducible:
Always

Steps to Reproduce:
1.Mentioned in description
2.
3.

Actual results:
The HA proxy gets stuck after reaching 2000 i.e max connections

Expected results:
The HA proxy gets stuck after reaching 2000 i.e max connections

Additional info:

Comment 8 Ben Bennett 2016-06-14 14:28:49 UTC


*** This bug has been marked as a duplicate of bug 1320233 ***

Note You need to log in before you can comment on or make changes to this bug.