Bug 1905100
Summary: | Too many haproxy processes in default-router pod causing high load average | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dhruv Gautam <dgautam> | ||||
Component: | Networking | Assignee: | Andrew McDermott <amcdermo> | ||||
Networking sub component: | router | QA Contact: | Arvind iyengar <aiyengar> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | urgent | CC: | aiyengar, amcdermo, aos-bugs, bperkins, ddelcian, hongli, kpelc, ltitov, mjoseph, mrobson, obockows, sthakare, wking | ||||
Version: | 4.5 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1918371 (view as bug list) | Environment: | |||||
Last Closed: | 2021-02-24 15:38:25 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1918371 | ||||||
Attachments: |
|
Description
Dhruv Gautam
2020-12-07 13:53:43 UTC
Hi Andrew Cu has raised the severity to sev2 and needs urgent resolution. Requested data has been attached. Kindly do the needful Created attachment 1738229 [details]
Script to gather socket and haproxy connection info
copy this script to /tmp
chmod u+x /tmp/haproxy-established-connections.sh
Then run this to collect socket information in the container and related to haproxy.
For example, run 20 times, sleeping 3s between invocations:
N=20 INTERVAL=3 /tmp/haproxy-established-connections.sh
The output will be written into a new directory, eg:
/tmp/haproxy-info-20201210-151811 /var/lib/haproxy/conf
tar up the directory and attach results to bugzilla.
Created attachment 1739334 [details]
what-changed.pl script
Shows the diff between two captures:
$ perl ~/what-disconnected.pl ./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120812/2/ss-443 ./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120816/3/ss-443
ESTABLISHED connection GONE: 10.70.242.170:58020
ESTABLISHED connection GONE: 10.70.242.170:58022
ESTABLISHED connection GONE: 10.70.242.170:58023
ESTABLISHED connection GONE: 10.225.33.50:16962
ESTABLISHED connection GONE: 10.226.2.20:49006
ESTABLISHED connection GONE: 10.226.2.20:49018
ESTABLISHED connection GONE: 10.226.2.20:49020
ESTABLISHED connection GONE: 10.226.2.20:49042
ESTABLISHED connection GONE: 10.226.2.20:49046
ESTABLISHED connection GONE: 10.226.2.20:49050
ESTABLISHED connection GONE: 10.226.2.20:49054
ESTABLISHED connection GONE: 10.226.2.20:49090
ESTABLISHED connection GONE: 10.226.2.20:49096
ESTABLISHED connection GONE: 10.226.2.20:49196
ESTABLISHED connection GONE: 10.227.70.10:57090
ESTABLISHED connection GONE: 10.227.77.254:43384
ESTABLISHED connection GONE: 10.241.165.14:58742
./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120812/2/ss-443 has 168 ESTABLISHED connections
./router-default-6dc44fdbb5-brjh5/haproxy-info-20201211-120808/120816/3/ss-443 has 152 ESTABLISHED connections
17 ESTABLISHED connections have gone
10.70.242.170 -- 3 dropped connections
10.225.33.50 -- 1 dropped connections
10.226.2.20 -- 10 dropped connections
10.227.70.10 -- 1 dropped connections
10.227.77.254 -- 1 dropped connections
10.241.165.14 -- 1 dropped connections
PR that adds a new annotation to control the "timeout tunnel" value per route: https://github.com/openshift/router/pull/239 *** Bug 1907941 has been marked as a duplicate of this bug. *** Also note some previous debugging on a similar issue in openshift v3.11: https://bugzilla.redhat.com/show_bug.cgi?id=1743291#c5 Verified in '4.7.0-0.nightly-2021-01-19-095812' release payload. With this version, the "timeout-tunnel" and "hard-stop-after" options appear to work as intended where the "haproxy.router.openshift.io/timeout-tunnel" annotation when applied along with "haproxy.router.openshift.io/timeout", both values gets preserved in the haproxy configuration for clear/edge/re-encrypt routes: ------ $ oc get route NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route-reen route-reen-test2.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com ... 1 more service-secure https reencrypt None $ oc describe route route-reen Name: route-reen Namespace: test2 Created: 2 minutes ago Labels: name=service-secure Annotations: haproxy.router.openshift.io/timeout=10s haproxy.router.openshift.io/timeout-tunnel=30s openshift.io/host.generated=true Requested Host: route-reen-test2.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com exposed on router default (host apps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 2 minutes ago exposed on router internalapps (host internalapps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 2 minutes ago Path: <none> TLS Termination: reencrypt Insecure Policy: <none> Endpoint Port: https Service: service-secure Weight: 100 (100%) Endpoints: 10.129.2.30:8443 backend be_secure:test2:route-reen mode http option redispatch option forwardfor balance leastconn timeout server 10s <--- timeout tunnel 30s <--- ----- Whereas for the passthrough routes, the "timeout-tunnel" will supersede 'timeout' values: ------ # Secure backend, pass through backend be_tcp:test1:route-passth balance source timeout tunnel 30s <---- hash-type consistent timeout check 5000ms server pod:caddy-rc-l9tjs:service-secure:https:10.128.2.34:8443 10.128.2.34:8443 weight 256 check inter 5000ms server pod:caddy-rc-zjkfh:service-secure:https:10.129.2.19:8443 10.129.2.19:8443 weight 256 check inter 5000ms $ oc describe route route-passth Name: route-passth Namespace: test1 Created: 10 minutes ago Labels: name=service-secure Annotations: haproxy.router.openshift.io/timeout=10s <<---- haproxy.router.openshift.io/timeout-tunnel=30s <<---- openshift.io/host.generated=true Requested Host: route-passth-test1.apps.aiyengar-oc47-2001.qe.devcluster.openshift.com exposed on router default (host apps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 10 minutes ago exposed on router internalapps (host internalapps.aiyengar-oc47-2001.qe.devcluster.openshift.com) 3 minutes ago Path: <none> TLS Termination: passthrough Insecure Policy: <none> Endpoint Port: https Service: service-secure Weight: 100 (100%) Endpoints: 10.128.2.34:8443, 10.129.2.19:8443 ------ * The 'hard-stop-after' options get applied globally with the annotation added to "ingresses.config/cluster": ---- $ oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-776f5769f5-5ndbl 1/1 Running 0 3m58s router-default-776f5769f5-l7tsh 1/1 Running 0 3m58s router-internalapps-65d54ff47f-4jqv8 2/2 Running 0 3m59s $ oc -n openshift-ingress get pods router-default-776f5769f5-l7tsh -o yaml | grep -i HARD -A1 | grep -iv "\{" -- - name: ROUTER_HARD_STOP_AFTER value: 1h ---- * It can now be applied on a per ingresscontroller basis where the value applied on any controller directly, will supersede the globally applied value: ---- $ oc -n openshift-ingress-operator annotate ingresscontrollers/default ingress.operator.openshift.io/hard-stop-after=30m ingresscontroller.operator.openshift.io/default annotated $ oc -n openshift-ingress get pods router-default-55897776c4-4m4tm -o yaml | grep -i HARD -A1 | grep -iv "\{" -- - name: ROUTER_HARD_STOP_AFTER value: 30m $ oc -n openshift-ingress get pods router-default-776f5769f5-5ndbl -o yaml | grep -i HARD -A1 | grep -iv "\{" -- - name: ROUTER_HARD_STOP_AFTER value: 1h ---- Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |