Description of problem: 1. Ingress Operator performs health checks against the ingress canary route. 2. Once the health check is done Ingress Operator doesn't close the TCP Connection to the LB. 3. While performing the next health check new connection is established to the LB instead of using the existing connection. 4. This causes the connection to build upon the LB. 5. Overtime exhausting the no. of connections on the LB. How reproducible: Yes, It is reproducible in any OpenShift 4.7+ cluster, Capture the TCP Dump at the pods level of the ingress operator Steps to Debug: 1. Find out on which node the Ingress Operator pod is running. $ oc get pods -n openshift-ingress-operator -o wide 2. Debug to the node on which the ingress pod is running and collect tcpdump. $ oc debug node/<Node-Name> 3. Capture TCPDump using the following article. How to use tcpdump inside OpenShift v4 Pod [ https://access.redhat.com/solutions/4569211 ] Actual results: 1. TCP Connection is kept alive. Expected results: 1. TCP Connection should be closed once the health check is performed.
Verified it with 4.11.0-0.nightly-2022-02-18-121223, TCP Keep-Alive packets can't be seen anymore. 1. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-02-18-121223 True False 26m Cluster version is 4.11.0-0.nightly-2022-02-18-121223 % 2. % oc -n openshift-ingress-operator get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ingress-operator-6b97f96dd-sq2fw 2/2 Running 2 (39m ago) 50m 10.130.0.22 shudi-411-gcpc3001-m54dd-master-0.c.openshift-qe.internal <none> <none> % 3. % oc -n openshift-ingress-canary get route NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD canary canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com ingress-canary 8080 edge/Redirect None % dig canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com ; <<>> DiG 9.10.6 <<>> canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38247 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1220 ;; QUESTION SECTION: ;canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com. IN A ;; ANSWER SECTION: canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com. 30 IN A 34.136.11.179 ;; Query time: 79 msec ;; SERVER: 10.72.17.5#53(10.72.17.5) ;; WHEN: Wed Feb 23 14:51:08 CST 2022 ;; MSG SIZE rcvd: 132 % 4. % oc debug node/shudi-411-gcpc3001-m54dd-master-0.c.openshift-qe.internal Starting pod/shudi-411-gcpc3001-m54dd-master-0copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.4 If you don't see a command prompt, try pressing enter. sh-4.4# NAME=ingress-operator-6b97f96dd-sq2fw sh-4.4# NAMESPACE=openshift-ingress-operator sh-4.4# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q) sh-4.4# ns_path="/host/$(chroot /host bash -c "crictl inspectp $pod_id | jq '.info.runtimeSpec.linux.namespaces[]|select(.type==\"network\").path' -r")" sh-4.4# nsenter_parameters="--net=${ns_path}" sh-4.4# nsenter $nsenter_parameters -- tcpdump -i any host 34.136.11.179 -s 0 -w 411cap1.pcap 5. copy the captured packets file to local machine and check it, there aren't the tcp keepalive packets
I copied the doc text from bug 2063283.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069