1905100 – Too many haproxy processes in default-router pod causing high load average

Bug 1905100 - Too many haproxy processes in default-router pod causing high load average

Summary: Too many haproxy processes in default-router pod causing high load average

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Andrew McDermott
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1907941 (view as bug list)
Depends On:
Blocks:	1918371
TreeView+	depends on / blocked

Reported:	2020-12-07 13:53 UTC by Dhruv Gautam
Modified:	2024-06-13 23:39 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1918371 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:38:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Script to gather socket and haproxy connection info (2.20 KB, application/x-shellscript) 2020-12-10 15:18 UTC, Andrew McDermott	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 522	None	closed	Bug 1905100: Add "ingress.operator.openshift.io/hard-stop-after" annotation	2021-02-16 03:36:02 UTC
Github	openshift router pull 239	None	closed	Bug 1905100: Add tunnel-timeout and hard-stop-after options to haproxy template	2021-02-16 03:36:02 UTC
Red Hat Knowledge Base (Solution)	5721381	None	None	None	2021-01-21 03:52:17 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:39:59 UTC

Description Dhruv Gautam 2020-12-07 13:53:43 UTC

Description of problem:
Too many processes in default-router pod causing high load average.


Version-Release number of selected component (if applicable):
OCP 4.5.14
It is hosted on VSphere.
It is NOT a disconnected environment

How reproducible:
NA

Steps to Reproduce:
1.
2.
3.

Actual results:
[QLF2542@tgm1r628 ~]$ oc rsh router-default-6dc44fdbb5-jw98m
sh-4.2$ ps aux | grep haproxy | wc -l
86
sh-4.2$ ps auxH | grep haproxy | wc -l
345
sh-4.2$ top
top - 16:38:44 up 9 days,  1:56,  0 users,  load average: 31.06, 34.16, 30.21
Tasks:  90 total,   5 running,  68 sleeping,   0 stopped,   0 zombie
%Cpu(s): 67.9 us, 19.9 sy,  0.0 ni,  2.6 id,  0.0 wa,  3.8 hi,  5.8 si,  0.0 st
KiB Mem : 32934204 total,  3806568 free, 24923072 used,  4204564 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  8895904 avail Mem



Expected results:
The count of haprocesses should be less

Additional info:
Attaching sosreport of the node and output of "ps -aux" and "ps -auxH" from default router pod.

Comment 15 Dhruv Gautam 2020-12-10 09:59:34 UTC

Hi Andrew

Cu has raised the severity to sev2 and needs urgent resolution. Requested data has been attached. Kindly do the needful

Comment 22 Andrew McDermott 2020-12-10 15:18:59 UTC

Created attachment 1738229 [details]
Script to gather socket and haproxy connection info

copy this script to /tmp
chmod u+x /tmp/haproxy-established-connections.sh

Then run this to collect socket information in the container and related to haproxy.

For example, run 20 times, sleeping 3s between invocations:

N=20 INTERVAL=3 /tmp/haproxy-established-connections.sh

The output will be written into a new directory, eg:

/tmp/haproxy-info-20201210-151811 /var/lib/haproxy/conf

tar up the directory and attach results to bugzilla.

Comment 37 Andrew McDermott 2020-12-15 13:33:58 UTC

Created attachment 1739334 [details]
what-changed.pl script

Shows the diff between two captures:

$ perl ~/what-disconnected.pl ./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120812/2/ss-443 ./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120816/3/ss-443 
ESTABLISHED connection GONE: 10.70.242.170:58020
ESTABLISHED connection GONE: 10.70.242.170:58022
ESTABLISHED connection GONE: 10.70.242.170:58023
ESTABLISHED connection GONE: 10.225.33.50:16962
ESTABLISHED connection GONE: 10.226.2.20:49006
ESTABLISHED connection GONE: 10.226.2.20:49018
ESTABLISHED connection GONE: 10.226.2.20:49020
ESTABLISHED connection GONE: 10.226.2.20:49042
ESTABLISHED connection GONE: 10.226.2.20:49046
ESTABLISHED connection GONE: 10.226.2.20:49050
ESTABLISHED connection GONE: 10.226.2.20:49054
ESTABLISHED connection GONE: 10.226.2.20:49090
ESTABLISHED connection GONE: 10.226.2.20:49096
ESTABLISHED connection GONE: 10.226.2.20:49196
ESTABLISHED connection GONE: 10.227.70.10:57090
ESTABLISHED connection GONE: 10.227.77.254:43384
ESTABLISHED connection GONE: 10.241.165.14:58742
./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120812/2/ss-443 has 168 ESTABLISHED connections
./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120816/3/ss-443 has 152 ESTABLISHED connections
17 ESTABLISHED connections have gone
10.70.242.170 -- 3 dropped connections
10.225.33.50 -- 1 dropped connections
10.226.2.20 -- 10 dropped connections
10.227.70.10 -- 1 dropped connections
10.227.77.254 -- 1 dropped connections
10.241.165.14 -- 1 dropped connections

Comment 43 Andrew McDermott 2020-12-17 14:41:41 UTC

PR that adds a new annotation to control the "timeout tunnel" value per route:

  https://github.com/openshift/router/pull/239

Comment 44 Andrew McDermott 2020-12-17 17:35:15 UTC

Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1907941

Comment 45 Andrew McDermott 2020-12-17 19:00:37 UTC

*** Bug 1907941 has been marked as a duplicate of this bug. ***

Comment 46 Andrew McDermott 2020-12-17 19:05:16 UTC

Also note some previous debugging on a similar issue in openshift v3.11:

 https://bugzilla.redhat.com/show_bug.cgi?id=1743291#c5

Comment 67 Arvind iyengar 2021-01-20 12:34:30 UTC

Verified in '4.7.0-0.nightly-2021-01-19-095812' release payload. With this version, the "timeout-tunnel" and "hard-stop-after" options appear to work as intended where the "haproxy.router.openshift.io/timeout-tunnel" annotation when applied along with "haproxy.router.openshift.io/timeout", both values gets preserved in the haproxy configuration for clear/edge/re-encrypt routes:
------
$ oc get route
NAME         HOST/PORT                                                                         PATH   SERVICES         PORT    TERMINATION   WILDCARD
route-reen   route-reen-test2.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com ... 1 more          service-secure   https   reencrypt     None

$ oc describe route route-reen
Name:			route-reen
Namespace:		test2
Created:		2 minutes ago
Labels:			name=service-secure
Annotations:		haproxy.router.openshift.io/timeout=10s
			haproxy.router.openshift.io/timeout-tunnel=30s
			openshift.io/host.generated=true
Requested Host:		route-reen-test2.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com
			   exposed on router default (host apps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 2 minutes ago
			   exposed on router internalapps (host internalapps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 2 minutes ago
Path:			<none>
TLS Termination:	reencrypt
Insecure Policy:	<none>
Endpoint Port:		https

Service:	service-secure
Weight:		100 (100%)
Endpoints:	10.129.2.30:8443

backend be_secure:test2:route-reen
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  10s <---
  timeout tunnel  30s <---
-----

Whereas for the passthrough routes, the "timeout-tunnel" will supersede 'timeout' values:
------
# Secure backend, pass through
backend be_tcp:test1:route-passth
  balance source
  timeout tunnel  30s <----
  hash-type consistent
  timeout check 5000ms
  server pod:caddy-rc-l9tjs:service-secure:https:10.128.2.34:8443 10.128.2.34:8443 weight 256 check inter 5000ms
  server pod:caddy-rc-zjkfh:service-secure:https:10.129.2.19:8443 10.129.2.19:8443 weight 256 check inter 5000ms

$ oc describe route route-passth
Name:			route-passth
Namespace:		test1
Created:		10 minutes ago
Labels:			name=service-secure
Annotations:		haproxy.router.openshift.io/timeout=10s <<----
			haproxy.router.openshift.io/timeout-tunnel=30s  <<----
			openshift.io/host.generated=true
Requested Host:		route-passth-test1.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com
			   exposed on router default (host apps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 10 minutes ago
			   exposed on router internalapps (host internalapps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 3 minutes ago
Path:			<none>
TLS Termination:	passthrough
Insecure Policy:	<none>
Endpoint Port:		https

Service:	service-secure
Weight:		100 (100%)
Endpoints:	10.128.2.34:8443, 10.129.2.19:8443
------

* The 'hard-stop-after' options get applied globally with the annotation added to "ingresses.config/cluster":
----
$ oc -n openshift-ingress get pods                                                                             
NAME                                   READY   STATUS    RESTARTS   AGE
router-default-776f5769f5-5ndbl        1/1     Running   0          3m58s
router-default-776f5769f5-l7tsh        1/1     Running   0          3m58s
router-internalapps-65d54ff47f-4jqv8   2/2     Running   0          3m59s

$ oc -n openshift-ingress get pods router-default-776f5769f5-l7tsh -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 1h
----

* It can now be applied on a per ingresscontroller basis where the value applied on any controller directly, will supersede the globally applied value:
----
$ oc -n openshift-ingress-operator annotate ingresscontrollers/default ingress.operator.openshift.io/hard-stop-after=30m
ingresscontroller.operator.openshift.io/default annotated

$ oc -n openshift-ingress get pods router-default-55897776c4-4m4tm -o yaml  | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m

$ oc -n openshift-ingress get pods router-default-776f5769f5-5ndbl -o yaml  | grep -i HARD -A1 | grep -iv  "\{"         
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 1h
----

Comment 70 errata-xmlrpc 2021-02-24 15:38:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.