Bug 1918371

Summary: Too many haproxy processes in default-router pod causing high load average
Product: OpenShift Container Platform Reporter: Andrew McDermott <amcdermo>
Component: NetworkingAssignee: Andrew McDermott <amcdermo>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aiyengar, amcdermo, aos-bugs, bperkins, ddelcian, dgautam, hongli, kpelc, ltitov, mjoseph, mrobson, obockows, skanakal, sthakare, wking
Version: 4.5   
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1905100 Environment:
Last Closed: 2021-02-01 15:24:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1905100    
Bug Blocks: 1920421, 1920423    

Comment 3 Arvind iyengar 2021-01-27 05:53:00 UTC
Verified in '4.6.0-0.nightly-2021-01-22-111850' release payload. With this version, the "timeout-tunnel" and "hard-stop-after" options appear to work as intended where the "haproxy.router.openshift.io/timeout-tunnel" annotation when applied along with "haproxy.router.openshift.io/timeout", both values gets preserved in the haproxy configuration for clear/edge/re-encrypt routes:
-----
$ oc get route -o wide
NAME                HOST/PORT                                                                                   PATH   SERVICES            PORT    TERMINATION   WILDCARD
edge-route          edge-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-unsecure    http    edge          None
reen-route          reen-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-secure      https   reencrypt     None
service-unsecure2   service-unsecure2-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more          service-unsecure2   http                  None

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:reen-route" haproxy.config  -A8         
backend be_secure:test1:reen-route
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  15s
  timeout tunnel  15s

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:edge-route" haproxy.config  -A8 
backend be_edge_http:test1:edge-route
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  15s
  timeout tunnel  5s

$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:service-unsecure2" haproxy.config  -A8 
backend be_http:test1:service-unsecure2
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  5s
  timeout tunnel  15s
-----

* Whereas for the passthrough routes, the "timeout-tunnel" will supersede 'timeout' values:
-----
$ oc get route -o wide
NAME                HOST/PORT                                                                                   PATH   SERVICES            PORT    TERMINATION   WILDCARD
edge-route          edge-route-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more                 service-unsecure    http    edge          None
route-passth        route-passth-test1.apps.aiyengar-oc46-patched.qe.devcluster.openshift.com ... 1 more               service-secure      https   passthrough   None


$ oc -n openshift-ingress exec router-default-d9855f598-kw7cr --  grep "test1:route-passth" haproxy.config  -A8 
backend be_tcp:test1:route-passth
  balance source
  timeout tunnel  15s
-----


* The 'hard-stop-after' options get applied globally with the annotation added to "ingresses.config/cluster":
------
$ oc annotate ingresses.config/cluster ingress.operator.openshift.io/hard-stop-after=30m                                     
ingress.config.openshift.io/cluster annotated

$ oc -n openshift-ingress get pods -o wide
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-5d74864b6-ghzj5         1/1     Running   0          112s   10.128.2.18   ip-10-0-210-123.us-east-2.compute.internal   <none>           <none>
router-default-5d74864b6-j9w6k         1/1     Running   0          112s   10.131.0.19   ip-10-0-142-157.us-east-2.compute.internal   <none>           <none>
router-internalapps-69bc6dc7d8-7qkkh   2/2     Running   0          111s   10.129.2.28   ip-10-0-162-250.us-east-2.compute.internal   <none>           <none>

$ oc -n openshift-ingress get pods router-default-5d74864b6-ghzj5 -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m

$ oc -n openshift-ingress get pods router-internalapps-69bc6dc7d8-7qkkh -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m
------

* It can now be applied on a per ingresscontroller basis where the value applied on any controller directly, will supersede the globally applied value:
------
$ oc -n openshift-ingress get pods -o wide                                                                                   
NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-5d74864b6-ghzj5         1/1     Running   0          10m     10.128.2.18   ip-10-0-210-123.us-east-2.compute.internal   <none>           <none>
router-default-5d74864b6-j9w6k         1/1     Running   0          10m     10.131.0.19   ip-10-0-142-157.us-east-2.compute.internal   <none>           <none>
router-internalapps-684489659b-9mnpz   2/2     Running   0          7m58s   10.129.2.29   ip-10-0-162-250.us-east-2.compute.internal   <none>           <none>

$ oc -n openshift-ingress-operator annotate ingresscontrollers/internalapps ingress.operator.openshift.io/hard-stop-after=15m
ingresscontroller.operator.openshift.io/internalapps annotate

$ oc -n openshift-ingress get pods router-internalapps-684489659b-9mnpz -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 15m

$ oc -n openshift-ingress get pods router-default-5d74864b6-ghzj5 -o yaml | grep -i HARD -A1 | grep -iv  "\{"
--
    - name: ROUTER_HARD_STOP_AFTER
      value: 30m
------

Comment 5 errata-xmlrpc 2021-02-01 15:24:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0235