2037447 – Ingress Operator is not closing TCP connections.

Bug 2037447 - Ingress Operator is not closing TCP connections.

Summary: Ingress Operator is not closing TCP connections.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Andrew McDermott
QA Contact:	Shudi Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2063283
TreeView+	depends on / blocked

Reported:	2022-01-05 16:43 UTC by Akash Semil
Modified:	2022-10-12 07:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Ingress Operator performs health checks against the ingress canary route. Once the health check is done Ingress Operator doesn't close the TCP Connection to the load balancer (LB) because keepalives are enabled on the connection. While performing the next health check a new connection is established to the LB instead of using the existing connection. Consequence: This causes the number connection to build upon the LB, overtime exhausting the number of connections on the LB. Fix: Disable keepalives when connecting to the canary route. Result: A new connection is made and closed each time the canary probe is run. With keepalives disabled there is no longer an accumulation of ESTABLISHED connections.
Clone Of:
Environment:
Last Closed:	2022-08-10 10:41:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 701	0	None	open	BUG 2037447: Disable keepalive for canary probe	2022-02-08 19:21:06 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:41:49 UTC

Description Akash Semil 2022-01-05 16:43:41 UTC

Description of problem:

1. Ingress Operator performs health checks against the ingress canary route.
2. Once the health check is done Ingress Operator doesn't close the TCP Connection to the LB.
3. While performing the next health check new connection is established to the LB instead of using the existing connection.
4. This causes the connection to build upon the LB.
5. Overtime exhausting the no. of connections on the LB.


How reproducible:

Yes, It is reproducible in any OpenShift 4.7+ cluster, Capture the TCP Dump at the pods level of the ingress operator

Steps to Debug:

1. Find out on which node the Ingress Operator pod is running.

$ oc get pods -n openshift-ingress-operator -o wide

2. Debug to the node on which the ingress pod is running and collect tcpdump.

$ oc debug node/<Node-Name>

3. Capture TCPDump using the following article.

How to use tcpdump inside OpenShift v4 Pod [ https://access.redhat.com/solutions/4569211 ]


Actual results:

1. TCP Connection is kept alive.

Expected results:

1. TCP Connection should be closed once the health check is performed.

Comment 9 Shudi Li 2022-02-23 08:24:03 UTC

Verified it with 4.11.0-0.nightly-2022-02-18-121223, TCP Keep-Alive packets can't be seen anymore.

1.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-18-121223   True        False         26m     Cluster version is 4.11.0-0.nightly-2022-02-18-121223
% 

2.
% oc -n openshift-ingress-operator get pods -o wide
NAME                               READY   STATUS    RESTARTS      AGE   IP            NODE                                                        NOMINATED NODE   READINESS GATES
ingress-operator-6b97f96dd-sq2fw   2/2     Running   2 (39m ago)   50m   10.130.0.22   shudi-411-gcpc3001-m54dd-master-0.c.openshift-qe.internal   <none>           <none>
% 

3.
% oc -n openshift-ingress-canary get route
NAME     HOST/PORT                                                                                 PATH   SERVICES         PORT   TERMINATION     WILDCARD
canary   canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com          ingress-canary   8080   edge/Redirect   None
% dig canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com

; <<>> DiG 9.10.6 <<>> canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38247
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
;; QUESTION SECTION:
;canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com. IN A

;; ANSWER SECTION:
canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com. 30 IN A 34.136.11.179

;; Query time: 79 msec
;; SERVER: 10.72.17.5#53(10.72.17.5)
;; WHEN: Wed Feb 23 14:51:08 CST 2022
;; MSG SIZE  rcvd: 132

%

4.
% oc debug node/shudi-411-gcpc3001-m54dd-master-0.c.openshift-qe.internal
Starting pod/shudi-411-gcpc3001-m54dd-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.4
If you don't see a command prompt, try pressing enter.
sh-4.4# NAME=ingress-operator-6b97f96dd-sq2fw
sh-4.4# NAMESPACE=openshift-ingress-operator
sh-4.4# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
sh-4.4# ns_path="/host/$(chroot /host bash -c "crictl inspectp $pod_id | jq '.info.runtimeSpec.linux.namespaces[]|select(.type==\"network\").path' -r")"
sh-4.4# nsenter_parameters="--net=${ns_path}"
sh-4.4# nsenter $nsenter_parameters -- tcpdump -i any host 34.136.11.179 -s 0 -w 411cap1.pcap

5. copy the captured packets file to local machine and check it, there aren't the tcp keepalive packets

Comment 18 Miciah Dashiel Butler Masters 2022-06-17 13:22:39 UTC

I copied the doc text from bug 2063283.

Comment 19 errata-xmlrpc 2022-08-10 10:41:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.