Bug 1918371 - Too many haproxy processes in default-router pod causing high load average
Summary: Too many haproxy processes in default-router pod causing high load average
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.z
Assignee: Andrew McDermott
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On: 1905100
Blocks: 1920421 1920423
TreeView+ depends on / blocked
 
Reported: 2021-01-20 15:02 UTC by Andrew McDermott
Modified: 2021-05-05 11:11 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1905100
Environment:
Last Closed: 2021-02-01 15:24:36 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 535 0 None closed Bug 1918371: Add "ingress.operator.openshift.io/hard-stop-after" annotation 2021-02-17 17:32:30 UTC
Github openshift router pull 249 0 None closed Bug 1918371: Add tunnel-timeout and hard-stop-after options to haproxy template 2021-02-17 17:32:30 UTC
Red Hat Product Errata RHBA-2021:0235 0 None None None 2021-02-01 15:24:54 UTC

Comment 3 Arvind iyengar 2021-01-27 05:53:00 UTC
Verified in '4.6.0-0.nightly-2021-01-22-111850' release payload. With this version, the "timeout-tunnel" and "hard-stop-after" options appear to work as intended where the "haproxy.router.openshift.io/timeout-tunnel" annotation when applied along with "haproxy.router.openshift.io/timeout", both values gets preserved in the haproxy configuration for clear/edge/re-encrypt routes:
-----
$ oc get route -o wide
NAME                HOST/PORT                                                                                   PATH   SERVICES            PORT    TERMINATION   WILDCARD
edge-route          edge-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-unsecure    http    edge          None
reen-route          reen-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-secure      https   reencrypt     None
service-unsecure2   service-unsecure2-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more          service-unsecure2   http                  None

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:reen-route" haproxy.config  -A8         
backend be_secure:test1:reen-route
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  15s
  timeout tunnel  15s

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:edge-route" haproxy.config  -A8 
backend be_edge_http:test1:edge-route
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  15s
  timeout tunnel  5s

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:service-unsecure2" haproxy.config  -A8 
backend be_http:test1:service-unsecure2
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  5s
  timeout tunnel  15s
-----

* Whereas for the passthrough routes, the "timeout-tunnel" will supersede 'timeout' values:
-----
$ oc get route -o wide
NAME                HOST/PORT                                                                                   PATH   SERVICES            PORT    TERMINATION   WILDCARD
edge-route          edge-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-unsecure    http    edge          None
route-passth        route-passth-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more               service-secure      https   passthrough   None


$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:route-passth" haproxy.config  -A8 
backend be_tcp:test1:route-passth
  balance source
  timeout tunnel  15s
-----


* The 'hard-stop-after' options get applied globally with the annotation added to "ingresses.config/cluster":
------
$ oc annotate ingresses.config/cluster ingress.operator.openshift.io/hard-stop-after=30m                                     
ingress.config.openshift.io/cluster annotated

$ oc -n openshift-ingress get pods -o wide
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-5d74864b6-ghzj5         1/1     Running   0          112s   10.128.2.18   ip-10-0-210-123.us-east-2.compute.internal   <none>           <none>
router-default-5d74864b6-j9w6k         1/1     Running   0          112s   10.131.0.19   ip-10-0-142-157.us-east-2.compute.internal   <none>           <none>
router-internalapps-69bc6dc7d8-7qkkh   2/2     Running   0          111s   10.129.2.28   ip-10-0-162-250.us-east-2.compute.internal   <none>           <none>

$ oc -n openshift-ingress get pods router-default-5d74864b6-ghzj5 -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m

$ oc -n openshift-ingress get pods router-internalapps-69bc6dc7d8-7qkkh -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m
------

* It can now be applied on a per ingresscontroller basis where the value applied on any controller directly, will supersede the globally applied value:
------
$ oc -n openshift-ingress get pods -o wide                                                                                   
NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-5d74864b6-ghzj5         1/1     Running   0          10m     10.128.2.18   ip-10-0-210-123.us-east-2.compute.internal   <none>           <none>
router-default-5d74864b6-j9w6k         1/1     Running   0          10m     10.131.0.19   ip-10-0-142-157.us-east-2.compute.internal   <none>           <none>
router-internalapps-684489659b-9mnpz   2/2     Running   0          7m58s   10.129.2.29   ip-10-0-162-250.us-east-2.compute.internal   <none>           <none>

$ oc -n openshift-ingress-operator annotate ingresscontrollers/internalapps ingress.operator.openshift.io/hard-stop-after=15m
ingresscontroller.operator.openshift.io/internalapps annotate

$ oc -n openshift-ingress get pods router-internalapps-684489659b-9mnpz -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 15m

$ oc -n openshift-ingress get pods router-default-5d74864b6-ghzj5 -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m
------

Comment 5 errata-xmlrpc 2021-02-01 15:24:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0235


Note You need to log in before you can comment on or make changes to this bug.