Bug 1970966 - Please port fix for bugzilla 1902091 to 4.6 and 4.5
Summary: Please port fix for bugzilla 1902091 to 4.6 and 4.5
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ---
: 4.5.z
Assignee: Ricardo Maraschini
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On: 1959563
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-11 14:53 UTC by OpenShift BugZilla Robot
Modified: 2021-07-02 00:40 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-02 00:40:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 692 0 None open [release-4.5] Bug 1970966: Using http.DefaultTransport timeouts 2021-06-15 18:41:14 UTC
Red Hat Product Errata RHBA-2021:2430 0 None None None 2021-07-02 00:40:50 UTC

Description OpenShift BugZilla Robot 2021-06-11 14:53:25 UTC
+++ This bug was initially created as a clone of Bug #1959563 +++

Description of problem:

Can the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1902091 be ported back to 4.5 and 4.6 ? 4.6 is our (IBM Cloud) latest release at this time.

We have an IBM Cloud customer that appears to be hitting that bug.

The image-registry operator goes to `upgrading` status with
$ oc describe co image-registry
Message: Progressing: Unable to apply resources: unable to sync storage configuration: RequestError: send request failed
Progressing: caused by: Head "https://s3.xxx/xxx": dial tcp x.x.x.x:443: connect: cannot assign requested address

If they delete the cluster-image-registry-operator pod, the status immediately changes to available. Eventually the operator goes back to `upgrading`.


Version-Release number of selected component (if applicable):
4.6


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from brian_mckeown.com on 2021-05-24 14:00:53 UTC ---

hello team - has anyone had a chance to review this request - please let us know your thoughts - thanks a lot.

--- Additional comment from obulatov on 2021-05-31 14:40:52 UTC ---

4.5 is going to reach EOL soon, so it's unlikely that we'll be able to backport it to 4.5.

--- Additional comment from lisowski.com on 2021-06-09 19:01:22 UTC ---

Created attachment 1789644 [details]
Shows active connections to COS after 13 days

Shows active connections to COS over time (this is after 13 days of image operator running)

--- Additional comment from lisowski.com on 2021-06-09 19:07:11 UTC ---

Hi Oleg:

Just to confirm as well: It looks like the leak to COS happens independent of errors as well (happens during normal operation). I do think we need to backport this to 4.5 or else the leak can cause other downstream issues due to the ports these connections are consuming (port exhaustion, etc).

We want to confirm as well you have validated that even during normal operations the leak is fixed.

Here is an example showing during normal operations the connections growing over time (goes up 130 connections after 20 minutes. Continues to grow from there. 







[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
24
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:45:17 CDT 2021
27
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:45:29 CDT 2021
28
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:46:00 CDT 2021
32
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:46:09 CDT 2021
32
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:46:15 CDT 2021
33
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:46:31 CDT 2021
35
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 14:04:58 CDT 2021
154
```

--- Additional comment from lisowski.com on 2021-06-09 19:08:58 UTC ---

Created attachment 1789645 [details]
logs of cluster image registry operator

--- Additional comment from lisowski.com on 2021-06-09 19:09:36 UTC ---

The logs of the image registry operator during this time show no errors reported as well for documentation purposes.

--- Additional comment from lisowski.com on 2021-06-09 19:11:11 UTC ---

Openshift verison these logs are from is `4.6.28`

--- Additional comment from lisowski.com on 2021-06-09 19:26:00 UTC ---

Checking in a 4.5 cluster the logs line up with after 7 hours 174 connections opened up:
```
172.17.56.112	:33092	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:33094	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:33636	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:33646	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:33652	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:34294	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:34298	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:34932	 ----> 	 172.21.0.1	:443	/proc/19888/net/tcp:
172.17.56.112	:35208	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35214	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35416	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35420	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35578	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35582	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35612	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35618	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35680	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35692	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35696	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35810	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35818	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:36418	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:37392	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:37400	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:37774	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:37780	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:38268	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:38278	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:39576	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:39582	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:39588	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:39596	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40164	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40170	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40298	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40302	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40416	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40418	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40464	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40472	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40488	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40492	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40514	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40782	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40786	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40790	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41116	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41122	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41128	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41344	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41522	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41524	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41754	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42164	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42168	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42612	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42620	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42630	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42670	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43534	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43540	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43548	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43556	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43896	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43900	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43910	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:44192	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:44312	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:44320	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:44324	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:45244	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:45252	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:45430	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:45434	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:46290	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:46296	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:46302	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47018	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47218	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47222	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47268	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47274	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47328	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47332	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47406	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47412	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47414	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47808	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47820	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49212	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49216	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49220	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49280	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49290	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49294	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49354	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50462	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50470	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50504	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50572	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50576	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51014	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51022	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51216	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51230	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51234	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51936	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51940	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52150	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52152	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52162	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52298	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52302	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52310	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52360	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52370	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52374	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53252	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53258	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53266	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53274	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53282	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53484	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53490	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53556	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53564	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53572	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53964	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54130	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54136	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54142	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54494	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54500	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54506	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54568	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54572	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54866	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:55182	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:55192	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:55344	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:55348	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56180	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56190	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56194	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56232	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56238	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56454	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56456	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56457	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56466	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56470	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56472	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56742	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56758	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57086	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57090	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57218	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57222	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57230	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57786	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57788	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58052	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58104	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58692	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58698	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58700	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:59822	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:59826	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:60578	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:60586	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:60846	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:60858	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
```

--- Additional comment from lisowski.com on 2021-06-09 19:26:23 UTC ---

```
 kubectl get pods -n openshift-image-registry -o wide
NAME                                               READY   STATUS    RESTARTS   AGE     IP              NODE            NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-8564975cc6-m9tqx   2/2     Running   0          7h14m   172.17.56.112   10.240.128.50   <none>           <none>
```

--- Additional comment from lisowski.com on 2021-06-09 19:35:48 UTC ---

Openshift version for 4.5 is 
```
4.5.40
```

--- Additional comment from lisowski.com on 2021-06-09 19:50:39 UTC ---

Correction for Openshift 4.5 version it is 
```
4.5.39
```

I suspect this bug was introduced between:
4.5.38 -  4.5.39 

and for 4.6
4.6.24 - 4.6.28

--- Additional comment from lisowski.com on 2021-06-09 22:07:16 UTC ---

These timelines are wrong: this has appeared to be in older openshift releases than that range


One key note though:
```
    storage:
      managementState: Unmanaged
      s3:
        bucket: XXXXX
        region: us-standard
        regionEndpoint: https://s3.direct.us.cloud-object-storage.appdomain.cloud
        virtualHostedStyle: false
    unsupportedConfigOverrides: null
```

If `managementState: Unmanaged` is set than the problem goes away (might need to restart the operator pod to fix the existing connections.

--- Additional comment from lisowski.com on 2021-06-09 22:10:49 UTC ---

I also see no problems on a cluster with….
    storage:
      s3:
        bucket: roks-bsr0toow08i9rajtg1a0-cdbf
        region: us-standard
        regionEndpoint: https://s3.direct.us.cloud-object-storage.appdomain.cloud
        virtualHostedStyle: false
    storageManaged: false

Specifically `storageManaged: false`

--- Additional comment from lisowski.com on 2021-06-10 03:55:53 UTC ---

Just to note: 

After 12 days we see the connection count go up to 28231 active connections. 


Looking into mitigation options

--- Additional comment from lisowski.com on 2021-06-10 04:11:29 UTC ---

Created attachment 1789710 [details]
This is a conntrack table output showing the leaking connections

This is a conntrack table output showing the leaking connections. This is after 12 days of usage

--- Additional comment from lisowski.com on 2021-06-10 04:22:05 UTC ---

Mitigation for now is:

1) Set the management state of the registry to Unmanaged to prevent the operator from leaking S3 Connections
```
oc patch configs.imageregistry.operator.openshift.io/cluster --type='json' -p='[{"op": "replace", "path": "/spec/managementState", "value":"Unmanaged"}]'
```

2) To flush out existing connections restart the operator with
```
oc delete pod -n openshift-image-registry -l name=cluster-image-registry-operator
pod "cluster-image-registry-operator-7cbb7997cd-vxkp5" deleted
```

Then connections will not leak anymore with the existing code base. However, patches/updates will not roll for the downstream registry pods until the management state is set to managed with:
```
oc patch configs.imageregistry.operator.openshift.io/cluster --type='json' -p='[{"op": "replace", "path": "/spec/managementState", "value":"Managed"}]'
```

--- Additional comment from lisowski.com on 2021-06-10 04:28:27 UTC ---

Correction: The leak is bounded at 28231 connections which corresponds to the local port range cri-o sets for a given container

--- Additional comment from rmarasch on 2021-06-10 13:54:41 UTC ---

Keeping this as ASSIGNED until I investigate the new data.

--- Additional comment from lisowski.com on 2021-06-10 22:04:29 UTC ---

Due to the scale of this leak I think this is a higher severity than a medium fix.

--- Additional comment from lisowski.com on 2021-06-10 22:09:21 UTC ---

I also verified for those that are wondering that this leak is present in the latest openshift 4.3 and 4.4 versions (they are unsupported. But I think it's good information to have for those that might be sitting back on upgrading). I believe it is originally introduced on versions after July 3rd 2019 with this PR: 
https://github.com/openshift/cluster-image-registry-operator/commit/7e102747e43afd87b165491f4a08480d3375ea56#diff-15ce517731313[…]f349dee84826febR72

--- Additional comment from lisowski.com on 2021-06-11 05:21:50 UTC ---

Update: I have built the patch for 4.6 and the connections to the S3 backend do appear to be stable. It is hovering around 10-12 connections in use at a time over multiple minutes (will check back in a few hours to confirm it's stable). I believe this proves the patch is valid (this is the patch for 4.6) 
[root@kube-c1rg7j4o0vnsob07caag-vpcocosa1-default-00000e20 /]# conntrack -L | grep 161.26.85.12
tcp      6 86386 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48106 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48106 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86396 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48196 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48196 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86387 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48376 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48376 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86376 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48012 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48012 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86396 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=49090 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=49090 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86396 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48786 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48786 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86386 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48996 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48996 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86377 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48278 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48278 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86376 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48898 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48898 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 6 CLOSE src=172.17.31.165 dst=161.26.85.12 sport=44176 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=44176 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.4 (conntrack-tools): 156 flow entries have been shown.
[root@kube-c1rg7j4o0vnsob07caag-vpcocosa1-default-00000e20 /]# conntrack -L | grep 161.26.85.12 | wc -l
conntrack v1.4.4 (conntrack-tools): 151 flow entries have been shown.
10

Comment 3 Wenjing Zheng 2021-06-24 08:25:50 UTC
Verified on 4.5.0-0.nightly-2021-06-21-181416.

Comment 5 errata-xmlrpc 2021-07-02 00:40:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.41 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2430


Note You need to log in before you can comment on or make changes to this bug.