Bug 1920421

Summary: Too many haproxy processes in default-router pod causing high load average
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: NetworkingAssignee: Andrew McDermott <amcdermo>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aiyengar, amcdermo, aos-bugs, bperkins, ddelcian, dgautam, hongli, kpelc, ltitov, mrobson, obockows, sthakare, wking
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-03 04:40:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1918371    
Bug Blocks:    

Comment 1 Andrew McDermott 2021-01-26 10:14:55 UTC
*** Bug 1920423 has been marked as a duplicate of this bug. ***

Comment 4 Arvind iyengar 2021-02-01 09:59:01 UTC
Verified in '4.5.0-0.nightly-2021-01-30-093850' release payload. With this version, "hard-stop-after" options appear to work as intended where the option get applied globally with the annotation added to "ingresses.config/cluster" and it can be applied on a per ingresscontroller basis as well:
------
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2021-01-30-093850   True        False         87m     Cluster version is 4.5.0-0.nightly-2021-01-30-093850

$ oc annotate ingresses.config/cluster ingress.operator.openshift.io/hard-stop-after=30m     
ingress.config.openshift.io/cluster annotated

$ oc -n openshift-ingress get pods router-default-6c5bbf6476-qn8lv -o yaml | grep -i HARD -A1 | grep -iv  "\{"~
              k:{"name":"ROUTER_HARD_STOP_AFTER"}:
                .: {}
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m


$ oc -n openshift-ingress get pods router-internalapps-574c9c47c5-bv2gw -o yaml | grep -i HARD -A1 | grep -iv  "\{"~
              k:{"name":"ROUTER_HARD_STOP_AFTER"}:
                .: {}
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m
------

When applied on per ingresscontroller basis:
------
$ oc -n openshift-ingress-operator annotate ingresscontrollers/internalapps ingress.operator.openshift.io/hard-stop-after=15m
ingresscontroller.operator.openshift.io/default annotated

$ oc -n openshift-ingress get pods router-default-6c5bbf6476-qn8lv -o yaml  | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m

$ oc -n openshift-ingress get pods router-internalapps-574c9c47c5-bv2gw -o yaml  | grep -i HARD -A1 | grep -iv  "\{"         
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 15m
------

Comment 5 Andrew McDermott 2021-02-01 11:40:52 UTC
Moving this back to POST as it needs to include https://github.com/openshift/router/pull/250.

Comment 6 Arvind iyengar 2021-02-02 10:23:10 UTC
Verified in '4.5.0-0.ci.test-2021-01-02-031712-ci-ln-dplk5kt release payload. With this version, the "timeout-tunnel"  option appears to work as intended where the "haproxy.router.openshift.io/timeout-tunnel" annotation when applied along with "haproxy.router.openshift.io/timeout", both values gets preserved in the haproxy configuration for clear/edge/re-encrypt routes:
-----
$ oc get route -o wide
NAME               HOST/PORT                                                                             PATH   SERVICES            PORT    TERMINATION   WILDCARD
edge-route         edge-route-test1.apps.ci-ln-dplk5kt-f76d1.origin-ci-int-gce.dev.openshift.com                service-unsecure2   http    edge          None
reen-route         reen-route-test1.apps.ci-ln-dplk5kt-f76d1.origin-ci-int-gce.dev.openshift.com                service-secure      https   reencrypt     None
service-unsecure   service-unsecure-test1.apps.ci-ln-dplk5kt-f76d1.origin-ci-int-gce.dev.openshift.com          service-unsecure    http                  None


$  oc annotate route  edge-route  haproxy.router.openshift.io/timeout-tunnel=5s
route.route.openshift.io/edge-route annotated

$ oc annotate route  edge-route  haproxy.router.openshift.io/timeout=15s 
route.route.openshift.io/edge-route annotated

$ oc annotate route  reen-route  haproxy.router.openshift.io/timeout=15s
route.route.openshift.io/reen-route annotated

$ oc annotate route  reen-route  haproxy.router.openshift.io/timeout-tunnel=5s 
route.route.openshift.io/reen-route annotated

$ oc annotate route  service-unsecure  haproxy.router.openshift.io/timeout-tunnel=15s
route.route.openshift.io/service-unsecure annotated

$ oc annotate route  service-unsecure  haproxy.router.openshift.io/timeout=5s  
route.route.openshift.io/service-unsecure annotated


oc -n openshift-ingress exec router-default-864d8b5b76-4brsr --  grep "test1:reen-route" haproxy.config  -A8  
backend be_secure:test1:reen-route
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  15s
  timeout tunnel  5s

 $ oc -n openshift-ingress exec router-default-864d8b5b76-4brsr --  grep "test1:edge-route" haproxy.config  -A8 
backend be_edge_http:test1:edge-route
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  15s
  timeout tunnel  5s

$ oc -n openshift-ingress exec router-default-864d8b5b76-4brsr --  grep "test1:service-unsecure" haproxy.config  -A8  
backend be_http:test1:service-unsecure
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  5s
  timeout tunnel  15s

-----

* Whereas for the passthrough routes, the "timeout-tunnel" will supersede 'timeout' values:
-----

$ oc get route -o wide
NAME               HOST/PORT                                                                             PATH   SERVICES            PORT    TERMINATION   WILDCARD
route-passth       route-passth-test1.apps.ci-ln-dplk5kt-f76d1.origin-ci-int-gce.dev.openshift.com              service-secure2     https   passthrough   None

$ oc annotate route  route-passth  haproxy.router.openshift.io/timeout-tunnel=15s                               
route.route.openshift.io/route-passth annotated

$ oc annotate route  route-passth  haproxy.router.openshift.io/timeout=5s         
route.route.openshift.io/route-passth annotated


backend be_tcp:test1:route-passth
  balance source
  timeout tunnel  15s
-----

Comment 10 Arvind iyengar 2021-02-08 08:04:32 UTC
Re-verified in the latest "4.5.0-0.nightly-2021-02-05-192721" release version. The "haproxy.router.openshift.io/timeout-tunnel" and "hard-stop-after" anntotion are fully functional.

Comment 12 errata-xmlrpc 2021-03-03 04:40:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.5.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0428