Bug 1973281 - cluster-image-registry-operator pod leaves connections open when fails connecting S3 storage (OCP 4.6.25)
Summary: cluster-image-registry-operator pod leaves connections open when fails connec...
Keywords:
Status: CLOSED DUPLICATE of bug 1959563
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.6
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-17 14:24 UTC by peter ducai
Modified: 2021-06-17 14:59 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-17 14:59:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator issues 657 0 None open The s3 storage backend leaks a new https connection every 10 seconds 2021-06-17 14:24:58 UTC
Red Hat Bugzilla 1902091 1 None None None 2021-06-17 14:24:58 UTC

Description peter ducai 2021-06-17 14:24:59 UTC
Description of problem:

The cluster image registry operator is opening connections and failing to close them which slowly exhausts our s3 provider. killing the pod clears the connections, but the cluster wont let the deployment stay scaled down.

Heres our findings with our storage vendor (Pure, using S3 provided by a FlashBlade):

On the Flashblade , we see clients opening TCP connections to the datavip on port 443; making an S3 request and receiving a response.

Once the response is received, the client continues to keep the session open by sending TCP keepalives. However, we do not see any further S3 requests on the same TCP connection.

The client goes on to open new TCP connections to make further S3 requests and continues to receive responses. All these connections are kept open. 

Over time , these connections accumulate and the total number of TCP connections open per blade exceed the capacity that it can handle. At that point we start seeing the 503 errors.   

Here is  a sample request we see 

2021-06-16 16:44:22.887225 ch1-fb7 http: INFO com.purestorage.fb.util.Logging log ::ffff:10.7.70.3 559f9081 head_bucket for nprd-icp-openshift4-registry
2021-06-16 16:44:22.887680 ch1-fb7 http: INFO com.purestorage.fb.util.Logging log ::ffff:10.7.70.3 559f9081 head_bucket for nprd-icp-openshift4-registry finished

These are the client IP's exhibiting this behavior. 
::ffff:10.7.54.1 
::ffff:10.7.69.3
::ffff:10.7.70.3
(these IP's are no longer valid as we have killed the cluster-image-registry-operator and they have restarted on other nodes, but we've confirmed its the cluster-image-registry-operator pod that is responsible for the connections.)




Version-Release number of selected component (if applicable):

4.6.25  

What information can you provide around timeframes and the business impact?

once it exhausts the connection pool of our S3 provider we are unable to interact with the registry, which includes building, pushing and pulling images


Additional info:

see linked bugs for possibly similar issue

Comment 1 Oleg Bulatov 2021-06-17 14:59:06 UTC

*** This bug has been marked as a duplicate of bug 1959563 ***


Note You need to log in before you can comment on or make changes to this bug.