Bug 1948441

Summary: ImagePullBackOff: Source image rejected: Too many open files
Product: OpenShift Container Platform Reporter: Andy Bartlett <andbartl>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, bjarolim, dwalsh, fiezzi, jhou, jokerman, moddi, openshift-bugs-escalate, tsweeney
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1953071 (view as bug list) Environment:
Last Closed: 2021-05-12 12:18:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1953071    
Bug Blocks:    

Description Andy Bartlett 2021-04-12 07:51:28 UTC
Description of problem:

My customer is seeing the following error when pulling images from Artificatory:

Openshift complains that it fails to pull images from their artifactory, when looking in the evens of the namespace we see:

[root@control-host-01 ~]# oc get events -n argocd
LAST SEEN   TYPE      REASON              OBJECT                                      MESSAGE
46m         Normal    Pulling             pod/argocd-secret-hook-kjghj                Pulling image "<fqdn>/gitlab/ubi8m-oc:latest"
46m         Warning   Failed              pod/argocd-secret-hook-kjghj                Failed to pull image "<fqdn>/gitlab/ubi8m-oc:latest": rpc error: code = Unknown desc = Source image rejected: Too many open files
46m         Warning   Failed              pod/argocd-secret-hook-kjghj                Error: ErrImagePull

The customer is not able to see anything odd in the openshift logs. Pulling the images only fails from a certain cluster, other clusters that are pulling from the same artifactory does not seem to have the same issues.

Further to this during testing it was noted:

podman pull would work
crioctl pull would fail with the above error.

Version-Release number of selected component (if applicable):

OCP 4.6.15


How reproducible:

Randomly reproducible, this does not happen all the time. 

Actual results:


Expected results:


Additional info:

Comment 18 Peter Hunt 2021-04-23 19:47:37 UTC
The configured ulimits should be able to handle the number of open FDs CRI-O has, but I've also discovered a leak in CRI-O that we forgot to backport to 4.6:
https://github.com/cri-o/cri-o/pull/4800

This should mitigate the situation (these connections would have been cleaned up, but it takes a while)

I believe upgrading to  a version of CRI-O with this patch will make this situation not happen anymore (or be *much* harder to reproduce). As such, moving this to POST

Comment 19 Peter Hunt 2021-04-23 20:27:44 UTC
here's another PR that *may* help once integrated (and is a leak regardless, so worth picking up)

Comment 21 Peter Hunt 2021-04-28 13:45:34 UTC
both attached PRs merged and will be in the next z stream

Comment 24 Sunil Choudhary 2021-05-06 04:20:54 UTC
Tried to trigger the issue locally by setting ulimit on an node just above what was currently being used and pulled an image.
Could not reproduce the issue. Also from bug description I see the issue happened randomly. I will mark it verified based on comment 18, 19.

Comment 26 errata-xmlrpc 2021-05-12 12:18:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.28 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1487