Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2037447

Summary: Ingress Operator is not closing TCP connections.
Product: OpenShift Container Platform Reporter: Akash Semil <asemil>
Component: NetworkingAssignee: Andrew McDermott <amcdermo>
Networking sub component: router QA Contact: Shudi Li <shudili>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: amcdermo, aos-bugs, bmehra, bpickard, hongli, mmasters, pwaghmod
Version: 4.7   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Ingress Operator performs health checks against the ingress canary route. Once the health check is done Ingress Operator doesn't close the TCP Connection to the load balancer (LB) because keepalives are enabled on the connection. While performing the next health check a new connection is established to the LB instead of using the existing connection. Consequence: This causes the number connection to build upon the LB, overtime exhausting the number of connections on the LB. Fix: Disable keepalives when connecting to the canary route. Result: A new connection is made and closed each time the canary probe is run. With keepalives disabled there is no longer an accumulation of ESTABLISHED connections.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:41:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2063283    

Description Akash Semil 2022-01-05 16:43:41 UTC
Description of problem:

1. Ingress Operator performs health checks against the ingress canary route.
2. Once the health check is done Ingress Operator doesn't close the TCP Connection to the LB.
3. While performing the next health check new connection is established to the LB instead of using the existing connection.
4. This causes the connection to build upon the LB.
5. Overtime exhausting the no. of connections on the LB.


How reproducible:

Yes, It is reproducible in any OpenShift 4.7+ cluster, Capture the TCP Dump at the pods level of the ingress operator

Steps to Debug:

1. Find out on which node the Ingress Operator pod is running.

$ oc get pods -n openshift-ingress-operator -o wide

2. Debug to the node on which the ingress pod is running and collect tcpdump.

$ oc debug node/<Node-Name>

3. Capture TCPDump using the following article.

How to use tcpdump inside OpenShift v4 Pod [ https://access.redhat.com/solutions/4569211 ]


Actual results:

1. TCP Connection is kept alive.

Expected results:

1. TCP Connection should be closed once the health check is performed.

Comment 9 Shudi Li 2022-02-23 08:24:03 UTC
Verified it with 4.11.0-0.nightly-2022-02-18-121223, TCP Keep-Alive packets can't be seen anymore.

1.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-18-121223   True        False         26m     Cluster version is 4.11.0-0.nightly-2022-02-18-121223
% 

2.
% oc -n openshift-ingress-operator get pods -o wide
NAME                               READY   STATUS    RESTARTS      AGE   IP            NODE                                                        NOMINATED NODE   READINESS GATES
ingress-operator-6b97f96dd-sq2fw   2/2     Running   2 (39m ago)   50m   10.130.0.22   shudi-411-gcpc3001-m54dd-master-0.c.openshift-qe.internal   <none>           <none>
% 

3.
% oc -n openshift-ingress-canary get route
NAME     HOST/PORT                                                                                 PATH   SERVICES         PORT   TERMINATION     WILDCARD
canary   canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com          ingress-canary   8080   edge/Redirect   None
% dig canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com

; <<>> DiG 9.10.6 <<>> canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38247
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
;; QUESTION SECTION:
;canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com. IN A

;; ANSWER SECTION:
canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com. 30 IN A 34.136.11.179

;; Query time: 79 msec
;; SERVER: 10.72.17.5#53(10.72.17.5)
;; WHEN: Wed Feb 23 14:51:08 CST 2022
;; MSG SIZE  rcvd: 132

%

4.
% oc debug node/shudi-411-gcpc3001-m54dd-master-0.c.openshift-qe.internal
Starting pod/shudi-411-gcpc3001-m54dd-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.4
If you don't see a command prompt, try pressing enter.
sh-4.4# NAME=ingress-operator-6b97f96dd-sq2fw
sh-4.4# NAMESPACE=openshift-ingress-operator
sh-4.4# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
sh-4.4# ns_path="/host/$(chroot /host bash -c "crictl inspectp $pod_id | jq '.info.runtimeSpec.linux.namespaces[]|select(.type==\"network\").path' -r")"
sh-4.4# nsenter_parameters="--net=${ns_path}"
sh-4.4# nsenter $nsenter_parameters -- tcpdump -i any host 34.136.11.179 -s 0 -w 411cap1.pcap

5. copy the captured packets file to local machine and check it, there aren't the tcp keepalive packets

Comment 18 Miciah Dashiel Butler Masters 2022-06-17 13:22:39 UTC
I copied the doc text from bug 2063283.

Comment 19 errata-xmlrpc 2022-08-10 10:41:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069