1918371 – Too many haproxy processes in default-router pod causing high load average

Bug 1918371 - Too many haproxy processes in default-router pod causing high load average

Summary: Too many haproxy processes in default-router pod causing high load average

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Andrew McDermott
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:	1905100
Blocks:	1920421 1920423
TreeView+	depends on / blocked

Reported:	2021-01-20 15:02 UTC by Andrew McDermott
Modified:	2024-12-20 19:32 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1905100
Environment:
Last Closed:	2021-02-01 15:24:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 535	None	closed	Bug 1918371: Add "ingress.operator.openshift.io/hard-stop-after" annotation	2021-02-17 17:32:30 UTC
Github	openshift router pull 249	None	closed	Bug 1918371: Add tunnel-timeout and hard-stop-after options to haproxy template	2021-02-17 17:32:30 UTC
Red Hat Product Errata	RHBA-2021:0235	None	None	None	2021-02-01 15:24:54 UTC

Comment 3 Arvind iyengar 2021-01-27 05:53:00 UTC

Verified in '4.6.0-0.nightly-2021-01-22-111850' release payload. With this version, the "timeout-tunnel" and "hard-stop-after" options appear to work as intended where the "haproxy.router.openshift.io/timeout-tunnel" annotation when applied along with "haproxy.router.openshift.io/timeout", both values gets preserved in the haproxy configuration for clear/edge/re-encrypt routes:
-----
$ oc get route -o wide
NAME                HOST/PORT                                                                                   PATH   SERVICES            PORT    TERMINATION   WILDCARD
edge-route          edge-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-unsecure    http    edge          None
reen-route          reen-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-secure      https   reencrypt     None
service-unsecure2   service-unsecure2-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more          service-unsecure2   http                  None

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:reen-route" haproxy.config  -A8         
backend be_secure:test1:reen-route
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  15s
  timeout tunnel  15s

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:edge-route" haproxy.config  -A8 
backend be_edge_http:test1:edge-route
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  15s
  timeout tunnel  5s

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:service-unsecure2" haproxy.config  -A8 
backend be_http:test1:service-unsecure2
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  5s
  timeout tunnel  15s
-----

* Whereas for the passthrough routes, the "timeout-tunnel" will supersede 'timeout' values:
-----
$ oc get route -o wide
NAME                HOST/PORT                                                                                   PATH   SERVICES            PORT    TERMINATION   WILDCARD
edge-route          edge-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-unsecure    http    edge          None
route-passth        route-passth-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more               service-secure      https   passthrough   None


$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:route-passth" haproxy.config  -A8 
backend be_tcp:test1:route-passth
  balance source
  timeout tunnel  15s
-----


* The 'hard-stop-after' options get applied globally with the annotation added to "ingresses.config/cluster":
------
$ oc annotate ingresses.config/cluster ingress.operator.openshift.io/hard-stop-after=30m                                     
ingress.config.openshift.io/cluster annotated

$ oc -n openshift-ingress get pods -o wide
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-5d74864b6-ghzj5         1/1     Running   0          112s   10.128.2.18   ip-10-0-210-123.us-east-2.compute.internal   <none>           <none>
router-default-5d74864b6-j9w6k         1/1     Running   0          112s   10.131.0.19   ip-10-0-142-157.us-east-2.compute.internal   <none>           <none>
router-internalapps-69bc6dc7d8-7qkkh   2/2     Running   0          111s   10.129.2.28   ip-10-0-162-250.us-east-2.compute.internal   <none>           <none>

$ oc -n openshift-ingress get pods router-default-5d74864b6-ghzj5 -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m

$ oc -n openshift-ingress get pods router-internalapps-69bc6dc7d8-7qkkh -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m
------

* It can now be applied on a per ingresscontroller basis where the value applied on any controller directly, will supersede the globally applied value:
------
$ oc -n openshift-ingress get pods -o wide                                                                                   
NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-5d74864b6-ghzj5         1/1     Running   0          10m     10.128.2.18   ip-10-0-210-123.us-east-2.compute.internal   <none>           <none>
router-default-5d74864b6-j9w6k         1/1     Running   0          10m     10.131.0.19   ip-10-0-142-157.us-east-2.compute.internal   <none>           <none>
router-internalapps-684489659b-9mnpz   2/2     Running   0          7m58s   10.129.2.29   ip-10-0-162-250.us-east-2.compute.internal   <none>           <none>

$ oc -n openshift-ingress-operator annotate ingresscontrollers/internalapps ingress.operator.openshift.io/hard-stop-after=15m
ingresscontroller.operator.openshift.io/internalapps annotate

$ oc -n openshift-ingress get pods router-internalapps-684489659b-9mnpz -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 15m

$ oc -n openshift-ingress get pods router-default-5d74864b6-ghzj5 -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m
------

Comment 5 errata-xmlrpc 2021-02-01 15:24:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0235

Note You need to log in before you can comment on or make changes to this bug.