Bug 2037447 - Ingress Operator is not closing TCP connections.
Summary: Ingress Operator is not closing TCP connections.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Andrew McDermott
QA Contact: Shudi Li
URL:
Whiteboard:
Depends On:
Blocks: 2063283
TreeView+ depends on / blocked
 
Reported: 2022-01-05 16:43 UTC by Akash Semil
Modified: 2022-10-12 07:03 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Ingress Operator performs health checks against the ingress canary route. Once the health check is done Ingress Operator doesn't close the TCP Connection to the load balancer (LB) because keepalives are enabled on the connection. While performing the next health check a new connection is established to the LB instead of using the existing connection. Consequence: This causes the number connection to build upon the LB, overtime exhausting the number of connections on the LB. Fix: Disable keepalives when connecting to the canary route. Result: A new connection is made and closed each time the canary probe is run. With keepalives disabled there is no longer an accumulation of ESTABLISHED connections.
Clone Of:
Environment:
Last Closed: 2022-08-10 10:41:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 701 0 None open BUG 2037447: Disable keepalive for canary probe 2022-02-08 19:21:06 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:41:49 UTC

Description Akash Semil 2022-01-05 16:43:41 UTC
Description of problem:

1. Ingress Operator performs health checks against the ingress canary route.
2. Once the health check is done Ingress Operator doesn't close the TCP Connection to the LB.
3. While performing the next health check new connection is established to the LB instead of using the existing connection.
4. This causes the connection to build upon the LB.
5. Overtime exhausting the no. of connections on the LB.


How reproducible:

Yes, It is reproducible in any OpenShift 4.7+ cluster, Capture the TCP Dump at the pods level of the ingress operator

Steps to Debug:

1. Find out on which node the Ingress Operator pod is running.

$ oc get pods -n openshift-ingress-operator -o wide

2. Debug to the node on which the ingress pod is running and collect tcpdump.

$ oc debug node/<Node-Name>

3. Capture TCPDump using the following article.

How to use tcpdump inside OpenShift v4 Pod [ https://access.redhat.com/solutions/4569211 ]


Actual results:

1. TCP Connection is kept alive.

Expected results:

1. TCP Connection should be closed once the health check is performed.

Comment 9 Shudi Li 2022-02-23 08:24:03 UTC
Verified it with 4.11.0-0.nightly-2022-02-18-121223, TCP Keep-Alive packets can't be seen anymore.

1.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-18-121223   True        False         26m     Cluster version is 4.11.0-0.nightly-2022-02-18-121223
% 

2.
% oc -n openshift-ingress-operator get pods -o wide
NAME                               READY   STATUS    RESTARTS      AGE   IP            NODE                                                        NOMINATED NODE   READINESS GATES
ingress-operator-6b97f96dd-sq2fw   2/2     Running   2 (39m ago)   50m   10.130.0.22   shudi-411-gcpc3001-m54dd-master-0.c.openshift-qe.internal   <none>           <none>
% 

3.
% oc -n openshift-ingress-canary get route
NAME     HOST/PORT                                                                                 PATH   SERVICES         PORT   TERMINATION     WILDCARD
canary   canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com          ingress-canary   8080   edge/Redirect   None
% dig canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com

; <<>> DiG 9.10.6 <<>> canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38247
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
;; QUESTION SECTION:
;canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com. IN A

;; ANSWER SECTION:
canary-openshift-ingress-canary.apps.shudi-411-gcpc3001.qe.gcp.devcluster.openshift.com. 30 IN A 34.136.11.179

;; Query time: 79 msec
;; SERVER: 10.72.17.5#53(10.72.17.5)
;; WHEN: Wed Feb 23 14:51:08 CST 2022
;; MSG SIZE  rcvd: 132

%

4.
% oc debug node/shudi-411-gcpc3001-m54dd-master-0.c.openshift-qe.internal
Starting pod/shudi-411-gcpc3001-m54dd-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.4
If you don't see a command prompt, try pressing enter.
sh-4.4# NAME=ingress-operator-6b97f96dd-sq2fw
sh-4.4# NAMESPACE=openshift-ingress-operator
sh-4.4# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
sh-4.4# ns_path="/host/$(chroot /host bash -c "crictl inspectp $pod_id | jq '.info.runtimeSpec.linux.namespaces[]|select(.type==\"network\").path' -r")"
sh-4.4# nsenter_parameters="--net=${ns_path}"
sh-4.4# nsenter $nsenter_parameters -- tcpdump -i any host 34.136.11.179 -s 0 -w 411cap1.pcap

5. copy the captured packets file to local machine and check it, there aren't the tcp keepalive packets

Comment 18 Miciah Dashiel Butler Masters 2022-06-17 13:22:39 UTC
I copied the doc text from bug 2063283.

Comment 19 errata-xmlrpc 2022-08-10 10:41:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.