Bug 2030821

Summary: cluster-image-registry-operator image-pruner Job sometimes fails in single node clusters
Product: OpenShift Container Platform Reporter: Omer Tuchfeld <otuchfel>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED DUPLICATE QA Contact: XiuJuan Wang <xiuwang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.9CC: aos-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-14 19:15:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Omer Tuchfeld 2021-12-09 19:24:22 UTC
Description of problem:
The image-pruner Job created in the openshift-image-registry namespace by the cluster-image-registry-operator sometimes fails during the periodic single-node conformance tests jobs

Version-Release number of selected component (if applicable):
So far observed during 4.9 nightly periodics but seems like it could also happen on 4.10

How reproducible:
See [2] [3] [4]

Steps to Reproduce:
See linked jobs

Actual results:
The job fails and the operator becomes degraded

Expected results:
The job should not fail / keep retrying so the operator will stop being degraded

Additional info:
It seems [1] the job has a backoff limit of 0 (according to the commit message which changed that, it was done in order to preserve logs). As a result, it tries only once, and if some issue occurs it causes the operator to be forever degraded.

I'm not sure about the exact circumstances that cause the job to fail, and why they happen on single-node clusters specifically (and those circumstances, whatever they may be, probably deserve their own separate ticket), but I would expect the Job to keep retrying even after failures - either by removing the backoff limit or having the operator re-create the job if it fails.

[1] https://github.com/openshift/cluster-image-registry-operator/blob/017bc0949bb6d5e57f3a31f4ab1d4f5213584d5c/pkg/resource/prunercronjob.go#L106

[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1467640520682508288

[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node-serial/1467278119713902592

[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1467278118946344960

Comment 1 Oleg Bulatov 2021-12-14 19:15:20 UTC

*** This bug has been marked as a duplicate of bug 1990125 ***