Bug 1959563

Summary: Please port fix for bugzilla 1902091 to 4.6 and 4.5
Product: OpenShift Container Platform Reporter: John McMeeking <jmcmeek>
Component: Image RegistryAssignee: Ricardo Maraschini <rmarasch>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: urgent Docs Contact:
Priority: medium    
Version: 4.6CC: aos-bugs, brian_mckeown, lisowski, pducai
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Lack of proper timeouts when executing http connections. Consequence: Connections remained opened, pilling up until the max limit, letting the operator incapable of process incoming events. Fix: Adding reasonable timeouts to the http client used by the operator. Result: Open connections are closed once the timeout is reach.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-22 08:29:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1902091    
Bug Blocks: 1970966    
Attachments:
Description Flags
Shows active connections to COS after 13 days
none
logs of cluster image registry operator
none
This is a conntrack table output showing the leaking connections none

Description John McMeeking 2021-05-11 19:07:06 UTC
Description of problem:

Can the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1902091 be ported back to 4.5 and 4.6 ? 4.6 is our (IBM Cloud) latest release at this time.

We have an IBM Cloud customer that appears to be hitting that bug.

The image-registry operator goes to `upgrading` status with
$ oc describe co image-registry
Message: Progressing: Unable to apply resources: unable to sync storage configuration: RequestError: send request failed
Progressing: caused by: Head "https://s3.xxx/xxx": dial tcp x.x.x.x:443: connect: cannot assign requested address

If they delete the cluster-image-registry-operator pod, the status immediately changes to available. Eventually the operator goes back to `upgrading`.


Version-Release number of selected component (if applicable):
4.6


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 brian_mckeown 2021-05-24 14:00:53 UTC
hello team - has anyone had a chance to review this request - please let us know your thoughts - thanks a lot.

Comment 2 Oleg Bulatov 2021-05-31 14:40:52 UTC
4.5 is going to reach EOL soon, so it's unlikely that we'll be able to backport it to 4.5.

Comment 3 Tyler Lisowski 2021-06-09 19:01:22 UTC
Created attachment 1789644 [details]
Shows active connections to COS after 13 days

Shows active connections to COS over time (this is after 13 days of image operator running)

Comment 4 Tyler Lisowski 2021-06-09 19:07:11 UTC
Hi Oleg:

Just to confirm as well: It looks like the leak to COS happens independent of errors as well (happens during normal operation). I do think we need to backport this to 4.5 or else the leak can cause other downstream issues due to the ports these connections are consuming (port exhaustion, etc).

We want to confirm as well you have validated that even during normal operations the leak is fixed.

Here is an example showing during normal operations the connections growing over time (goes up 130 connections after 20 minutes. Continues to grow from there. 







[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
24
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:45:17 CDT 2021
27
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:45:29 CDT 2021
28
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:46:00 CDT 2021
32
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:46:09 CDT 2021
32
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:46:15 CDT 2021
33
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 13:46:31 CDT 2021
35
[root@kube-c2a0njgd06pi318lnrl0-cp4appdso16-default-00000164 /]# date; find /proc/ 2>/dev/null | grep tcp | grep -v task | grep -v sys/net | xargs grep -v rem_address 2>/dev/null | awk '{x=strtonum("0x"substr($3,index($3,":")-2,2)); y=strtonum("0x"substr($4,index($4,":")-2,2)); for (i=5; i>0; i-=2) x = x"."strtonum("0x"substr($3,i,2)); for (i=5; i>0; i-=2) y = y"."strtonum("0x"substr($4,i,2))}{printf ("%s\t:%s\t ----> \t %s\t:%s\t%s\n",x,strtonum("0x"substr($3,index($3,":")+1,4)),y,strtonum("0x"substr($4,index($4,":")+1,4)),$1)}' | sort | uniq --check-chars=25 | grep "161.26.5.14" | grep "172.17.49.204" | wc -l
Wed Jun  9 14:04:58 CDT 2021
154
```

Comment 5 Tyler Lisowski 2021-06-09 19:08:58 UTC
Created attachment 1789645 [details]
logs of cluster image registry operator

Comment 6 Tyler Lisowski 2021-06-09 19:09:36 UTC
The logs of the image registry operator during this time show no errors reported as well for documentation purposes.

Comment 7 Tyler Lisowski 2021-06-09 19:11:11 UTC
Openshift verison these logs are from is `4.6.28`

Comment 8 Tyler Lisowski 2021-06-09 19:26:00 UTC
Checking in a 4.5 cluster the logs line up with after 7 hours 174 connections opened up:
```
172.17.56.112	:33092	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:33094	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:33636	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:33646	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:33652	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:34294	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:34298	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:34932	 ----> 	 172.21.0.1	:443	/proc/19888/net/tcp:
172.17.56.112	:35208	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35214	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35416	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35420	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35578	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35582	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35612	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35618	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35680	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35692	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35696	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35810	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:35818	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:36418	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:37392	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:37400	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:37774	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:37780	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:38268	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:38278	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:39576	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:39582	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:39588	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:39596	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40164	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40170	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40298	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40302	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40416	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40418	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40464	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40472	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40488	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40492	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40514	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40782	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40786	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:40790	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41116	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41122	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41128	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41344	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41522	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41524	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:41754	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42164	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42168	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42612	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42620	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42630	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:42670	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43534	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43540	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43548	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43556	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43896	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43900	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:43910	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:44192	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:44312	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:44320	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:44324	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:45244	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:45252	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:45430	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:45434	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:46290	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:46296	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:46302	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47018	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47218	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47222	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47268	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47274	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47328	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47332	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47406	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47412	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47414	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47808	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:47820	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49212	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49216	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49220	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49280	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49290	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49294	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:49354	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50462	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50470	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50504	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50572	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:50576	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51014	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51022	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51216	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51230	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51234	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51936	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:51940	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52150	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52152	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52162	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52298	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52302	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52310	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52360	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52370	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:52374	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53252	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53258	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53266	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53274	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53282	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53484	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53490	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53556	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53564	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53572	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:53964	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54130	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54136	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54142	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54494	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54500	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54506	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54568	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54572	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:54866	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:55182	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:55192	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:55344	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:55348	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56180	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56190	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56194	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56232	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56238	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56454	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56456	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56457	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56466	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56470	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56472	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56742	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:56758	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57086	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57090	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57218	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57222	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57230	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57786	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:57788	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58052	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58104	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58692	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58698	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:58700	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:59822	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:59826	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:60578	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:60586	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:60846	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
172.17.56.112	:60858	 ----> 	 161.26.5.14	:443	/proc/19888/net/tcp:
```

Comment 9 Tyler Lisowski 2021-06-09 19:26:23 UTC
```
 kubectl get pods -n openshift-image-registry -o wide
NAME                                               READY   STATUS    RESTARTS   AGE     IP              NODE            NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-8564975cc6-m9tqx   2/2     Running   0          7h14m   172.17.56.112   10.240.128.50   <none>           <none>
```

Comment 10 Tyler Lisowski 2021-06-09 19:35:48 UTC
Openshift version for 4.5 is 
```
4.5.40
```

Comment 11 Tyler Lisowski 2021-06-09 19:50:39 UTC
Correction for Openshift 4.5 version it is 
```
4.5.39
```

I suspect this bug was introduced between:
4.5.38 -  4.5.39 

and for 4.6
4.6.24 - 4.6.28

Comment 12 Tyler Lisowski 2021-06-09 22:07:16 UTC
These timelines are wrong: this has appeared to be in older openshift releases than that range


One key note though:
```
    storage:
      managementState: Unmanaged
      s3:
        bucket: XXXXX
        region: us-standard
        regionEndpoint: https://s3.direct.us.cloud-object-storage.appdomain.cloud
        virtualHostedStyle: false
    unsupportedConfigOverrides: null
```

If `managementState: Unmanaged` is set than the problem goes away (might need to restart the operator pod to fix the existing connections.

Comment 13 Tyler Lisowski 2021-06-09 22:10:49 UTC
I also see no problems on a cluster with….
    storage:
      s3:
        bucket: roks-bsr0toow08i9rajtg1a0-cdbf
        region: us-standard
        regionEndpoint: https://s3.direct.us.cloud-object-storage.appdomain.cloud
        virtualHostedStyle: false
    storageManaged: false

Specifically `storageManaged: false`

Comment 14 Tyler Lisowski 2021-06-10 03:55:53 UTC
Just to note: 

After 12 days we see the connection count go up to 28231 active connections. 


Looking into mitigation options

Comment 15 Tyler Lisowski 2021-06-10 04:11:29 UTC
Created attachment 1789710 [details]
This is a conntrack table output showing the leaking connections

This is a conntrack table output showing the leaking connections. This is after 12 days of usage

Comment 16 Tyler Lisowski 2021-06-10 04:22:05 UTC
Mitigation for now is:

1) Set the management state of the registry to Unmanaged to prevent the operator from leaking S3 Connections
```
oc patch configs.imageregistry.operator.openshift.io/cluster --type='json' -p='[{"op": "replace", "path": "/spec/managementState", "value":"Unmanaged"}]'
```

2) To flush out existing connections restart the operator with
```
oc delete pod -n openshift-image-registry -l name=cluster-image-registry-operator
pod "cluster-image-registry-operator-7cbb7997cd-vxkp5" deleted
```

Then connections will not leak anymore with the existing code base. However, patches/updates will not roll for the downstream registry pods until the management state is set to managed with:
```
oc patch configs.imageregistry.operator.openshift.io/cluster --type='json' -p='[{"op": "replace", "path": "/spec/managementState", "value":"Managed"}]'
```

Comment 17 Tyler Lisowski 2021-06-10 04:28:27 UTC
Correction: The leak is bounded at 28231 connections which corresponds to the local port range cri-o sets for a given container

Comment 18 Ricardo Maraschini 2021-06-10 13:54:41 UTC
Keeping this as ASSIGNED until I investigate the new data.

Comment 19 Tyler Lisowski 2021-06-10 22:04:29 UTC
Due to the scale of this leak I think this is a higher severity than a medium fix.

Comment 20 Tyler Lisowski 2021-06-10 22:09:21 UTC
I also verified for those that are wondering that this leak is present in the latest openshift 4.3 and 4.4 versions (they are unsupported. But I think it's good information to have for those that might be sitting back on upgrading). I believe it is originally introduced on versions after July 3rd 2019 with this PR: 
https://github.com/openshift/cluster-image-registry-operator/commit/7e102747e43afd87b165491f4a08480d3375ea56#diff-15ce517731313[…]f349dee84826febR72

Comment 21 Tyler Lisowski 2021-06-11 05:21:50 UTC
Update: I have built the patch for 4.6 and the connections to the S3 backend do appear to be stable. It is hovering around 10-12 connections in use at a time over multiple minutes (will check back in a few hours to confirm it's stable). I believe this proves the patch is valid (this is the patch for 4.6) 
[root@kube-c1rg7j4o0vnsob07caag-vpcocosa1-default-00000e20 /]# conntrack -L | grep 161.26.85.12
tcp      6 86386 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48106 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48106 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86396 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48196 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48196 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86387 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48376 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48376 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86376 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48012 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48012 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86396 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=49090 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=49090 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86396 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48786 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48786 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86386 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48996 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48996 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86377 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48278 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48278 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86376 ESTABLISHED src=172.17.31.165 dst=161.26.85.12 sport=48898 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=48898 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 6 CLOSE src=172.17.31.165 dst=161.26.85.12 sport=44176 dport=443 src=161.26.85.12 dst=10.180.0.18 sport=443 dport=44176 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.4 (conntrack-tools): 156 flow entries have been shown.
[root@kube-c1rg7j4o0vnsob07caag-vpcocosa1-default-00000e20 /]# conntrack -L | grep 161.26.85.12 | wc -l
conntrack v1.4.4 (conntrack-tools): 151 flow entries have been shown.
10

Comment 22 Tyler Lisowski 2021-06-11 15:00:02 UTC
Confirmed it's still holding steady over 9 hours:

The Openshift team has kindly accepted back-porting this to 4.5 even though it's near EOL (thank you Openshift team).

Comment 25 Tyler Lisowski 2021-06-15 18:40:21 UTC
trying to get bug in place

Comment 26 Tyler Lisowski 2021-06-15 18:40:56 UTC
update to verified

Comment 27 Oleg Bulatov 2021-06-17 14:59:05 UTC
*** Bug 1973281 has been marked as a duplicate of this bug. ***

Comment 29 errata-xmlrpc 2021-06-22 08:29:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.35 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2410