Description of problem:
The cluster image registry operator is opening connections and failing to close them which slowly exhausts our s3 provider. killing the pod clears the connections, but the cluster wont let the deployment stay scaled down.
Heres our findings with our storage vendor (Pure, using S3 provided by a FlashBlade):
On the Flashblade , we see clients opening TCP connections to the datavip on port 443; making an S3 request and receiving a response.
Once the response is received, the client continues to keep the session open by sending TCP keepalives. However, we do not see any further S3 requests on the same TCP connection.
The client goes on to open new TCP connections to make further S3 requests and continues to receive responses. All these connections are kept open.
Over time , these connections accumulate and the total number of TCP connections open per blade exceed the capacity that it can handle. At that point we start seeing the 503 errors.
Here is a sample request we see
2021-06-16 16:44:22.887225 ch1-fb7 http: INFO com.purestorage.fb.util.Logging log ::ffff:10.7.70.3 559f9081 head_bucket for nprd-icp-openshift4-registry
2021-06-16 16:44:22.887680 ch1-fb7 http: INFO com.purestorage.fb.util.Logging log ::ffff:10.7.70.3 559f9081 head_bucket for nprd-icp-openshift4-registry finished
These are the client IP's exhibiting this behavior.
(these IP's are no longer valid as we have killed the cluster-image-registry-operator and they have restarted on other nodes, but we've confirmed its the cluster-image-registry-operator pod that is responsible for the connections.)
Version-Release number of selected component (if applicable):
What information can you provide around timeframes and the business impact?
once it exhausts the connection pool of our S3 provider we are unable to interact with the registry, which includes building, pushing and pulling images
see linked bugs for possibly similar issue
*** This bug has been marked as a duplicate of bug 1959563 ***