Bug 1973281

Summary: cluster-image-registry-operator pod leaves connections open when fails connecting S3 storage (OCP 4.6.25)
Product: OpenShift Container Platform Reporter: peter ducai <pducai>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED DUPLICATE QA Contact: Wenjing Zheng <wzheng>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: aos-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-17 14:59:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description peter ducai 2021-06-17 14:24:59 UTC
Description of problem:

The cluster image registry operator is opening connections and failing to close them which slowly exhausts our s3 provider. killing the pod clears the connections, but the cluster wont let the deployment stay scaled down.

Heres our findings with our storage vendor (Pure, using S3 provided by a FlashBlade):

On the Flashblade , we see clients opening TCP connections to the datavip on port 443; making an S3 request and receiving a response.

Once the response is received, the client continues to keep the session open by sending TCP keepalives. However, we do not see any further S3 requests on the same TCP connection.

The client goes on to open new TCP connections to make further S3 requests and continues to receive responses. All these connections are kept open. 

Over time , these connections accumulate and the total number of TCP connections open per blade exceed the capacity that it can handle. At that point we start seeing the 503 errors.   

Here is  a sample request we see 

2021-06-16 16:44:22.887225 ch1-fb7 http: INFO com.purestorage.fb.util.Logging log ::ffff:10.7.70.3 559f9081 head_bucket for nprd-icp-openshift4-registry
2021-06-16 16:44:22.887680 ch1-fb7 http: INFO com.purestorage.fb.util.Logging log ::ffff:10.7.70.3 559f9081 head_bucket for nprd-icp-openshift4-registry finished

These are the client IP's exhibiting this behavior. 
::ffff:10.7.54.1 
::ffff:10.7.69.3
::ffff:10.7.70.3
(these IP's are no longer valid as we have killed the cluster-image-registry-operator and they have restarted on other nodes, but we've confirmed its the cluster-image-registry-operator pod that is responsible for the connections.)




Version-Release number of selected component (if applicable):

4.6.25  

What information can you provide around timeframes and the business impact?

once it exhausts the connection pool of our S3 provider we are unable to interact with the registry, which includes building, pushing and pulling images


Additional info:

see linked bugs for possibly similar issue

Comment 1 Oleg Bulatov 2021-06-17 14:59:06 UTC

*** This bug has been marked as a duplicate of bug 1959563 ***