Hide Forgot
Description of problem: Too many processes in default-router pod causing high load average. Version-Release number of selected component (if applicable): OCP 4.5.14 It is hosted on VSphere. It is NOT a disconnected environment How reproducible: NA Steps to Reproduce: 1. 2. 3. Actual results: [QLF2542@tgm1r628 ~]$ oc rsh router-default-6dc44fdbb5-jw98m sh-4.2$ ps aux | grep haproxy | wc -l 86 sh-4.2$ ps auxH | grep haproxy | wc -l 345 sh-4.2$ top top - 16:38:44 up 9 days, 1:56, 0 users, load average: 31.06, 34.16, 30.21 Tasks: 90 total, 5 running, 68 sleeping, 0 stopped, 0 zombie %Cpu(s): 67.9 us, 19.9 sy, 0.0 ni, 2.6 id, 0.0 wa, 3.8 hi, 5.8 si, 0.0 st KiB Mem : 32934204 total, 3806568 free, 24923072 used, 4204564 buff/cache KiB Swap: 0 total, 0 free, 0 used. 8895904 avail Mem Expected results: The count of haprocesses should be less Additional info: Attaching sosreport of the node and output of "ps -aux" and "ps -auxH" from default router pod.
Hi Andrew Cu has raised the severity to sev2 and needs urgent resolution. Requested data has been attached. Kindly do the needful
Created attachment 1738229 [details] Script to gather socket and haproxy connection info copy this script to /tmp chmod u+x /tmp/haproxy-established-connections.sh Then run this to collect socket information in the container and related to haproxy. For example, run 20 times, sleeping 3s between invocations: N=20 INTERVAL=3 /tmp/haproxy-established-connections.sh The output will be written into a new directory, eg: /tmp/haproxy-info-20201210-151811 /var/lib/haproxy/conf tar up the directory and attach results to bugzilla.
Created attachment 1739334 [details] what-changed.pl script Shows the diff between two captures: $ perl ~/what-disconnected.pl ./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120812/2/ss-443 ./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120816/3/ss-443 ESTABLISHED connection GONE: 10.70.242.170:58020 ESTABLISHED connection GONE: 10.70.242.170:58022 ESTABLISHED connection GONE: 10.70.242.170:58023 ESTABLISHED connection GONE: 10.225.33.50:16962 ESTABLISHED connection GONE: 10.226.2.20:49006 ESTABLISHED connection GONE: 10.226.2.20:49018 ESTABLISHED connection GONE: 10.226.2.20:49020 ESTABLISHED connection GONE: 10.226.2.20:49042 ESTABLISHED connection GONE: 10.226.2.20:49046 ESTABLISHED connection GONE: 10.226.2.20:49050 ESTABLISHED connection GONE: 10.226.2.20:49054 ESTABLISHED connection GONE: 10.226.2.20:49090 ESTABLISHED connection GONE: 10.226.2.20:49096 ESTABLISHED connection GONE: 10.226.2.20:49196 ESTABLISHED connection GONE: 10.227.70.10:57090 ESTABLISHED connection GONE: 10.227.77.254:43384 ESTABLISHED connection GONE: 10.241.165.14:58742 ./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120812/2/ss-443 has 168 ESTABLISHED connections ./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120816/3/ss-443 has 152 ESTABLISHED connections 17 ESTABLISHED connections have gone 10.70.242.170 -- 3 dropped connections 10.225.33.50 -- 1 dropped connections 10.226.2.20 -- 10 dropped connections 10.227.70.10 -- 1 dropped connections 10.227.77.254 -- 1 dropped connections 10.241.165.14 -- 1 dropped connections
PR that adds a new annotation to control the "timeout tunnel" value per route: https://github.com/openshift/router/pull/239
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1907941
*** Bug 1907941 has been marked as a duplicate of this bug. ***
Also note some previous debugging on a similar issue in openshift v3.11: https://bugzilla.redhat.com/show_bug.cgi?id=1743291#c5
Verified in '4.7.0-0.nightly-2021-01-19-095812' release payload. With this version, the "timeout-tunnel" and "hard-stop-after" options appear to work as intended where the "haproxy.router.openshift.io/timeout-tunnel" annotation when applied along with "haproxy.router.openshift.io/timeout", both values gets preserved in the haproxy configuration for clear/edge/re-encrypt routes: ------ $ oc get route NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route-reen route-reen-test2.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com ... 1 more service-secure https reencrypt None $ oc describe route route-reen Name: route-reen Namespace: test2 Created: 2 minutes ago Labels: name=service-secure Annotations: haproxy.router.openshift.io/timeout=10s haproxy.router.openshift.io/timeout-tunnel=30s openshift.io/host.generated=true Requested Host: route-reen-test2.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com exposed on router default (host apps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 2 minutes ago exposed on router internalapps (host internalapps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 2 minutes ago Path: <none> TLS Termination: reencrypt Insecure Policy: <none> Endpoint Port: https Service: service-secure Weight: 100 (100%) Endpoints: 10.129.2.30:8443 backend be_secure:test2:route-reen mode http option redispatch option forwardfor balance leastconn timeout server 10s <--- timeout tunnel 30s <--- ----- Whereas for the passthrough routes, the "timeout-tunnel" will supersede 'timeout' values: ------ # Secure backend, pass through backend be_tcp:test1:route-passth balance source timeout tunnel 30s <---- hash-type consistent timeout check 5000ms server pod:caddy-rc-l9tjs:service-secure:https:10.128.2.34:8443 10.128.2.34:8443 weight 256 check inter 5000ms server pod:caddy-rc-zjkfh:service-secure:https:10.129.2.19:8443 10.129.2.19:8443 weight 256 check inter 5000ms $ oc describe route route-passth Name: route-passth Namespace: test1 Created: 10 minutes ago Labels: name=service-secure Annotations: haproxy.router.openshift.io/timeout=10s <<---- haproxy.router.openshift.io/timeout-tunnel=30s <<---- openshift.io/host.generated=true Requested Host: route-passth-test1.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com exposed on router default (host apps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 10 minutes ago exposed on router internalapps (host internalapps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 3 minutes ago Path: <none> TLS Termination: passthrough Insecure Policy: <none> Endpoint Port: https Service: service-secure Weight: 100 (100%) Endpoints: 10.128.2.34:8443, 10.129.2.19:8443 ------ * The 'hard-stop-after' options get applied globally with the annotation added to "ingresses.config/cluster": ---- $ oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-776f5769f5-5ndbl 1/1 Running 0 3m58s router-default-776f5769f5-l7tsh 1/1 Running 0 3m58s router-internalapps-65d54ff47f-4jqv8 2/2 Running 0 3m59s $ oc -n openshift-ingress get pods router-default-776f5769f5-l7tsh -o yaml | grep -i HARD -A1 | grep -iv "\{" -- - name: ROUTER_HARD_STOP_AFTER value: 1h ---- * It can now be applied on a per ingresscontroller basis where the value applied on any controller directly, will supersede the globally applied value: ---- $ oc -n openshift-ingress-operator annotate ingresscontrollers/default ingress.operator.openshift.io/hard-stop-after=30m ingresscontroller.operator.openshift.io/default annotated $ oc -n openshift-ingress get pods router-default-55897776c4-4m4tm -o yaml | grep -i HARD -A1 | grep -iv "\{" -- - name: ROUTER_HARD_STOP_AFTER value: 30m $ oc -n openshift-ingress get pods router-default-776f5769f5-5ndbl -o yaml | grep -i HARD -A1 | grep -iv "\{" -- - name: ROUTER_HARD_STOP_AFTER value: 1h ----
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633