Bug 2030821

Summary:	cluster-image-registry-operator image-pruner Job sometimes fails in single node clusters
Product:	OpenShift Container Platform	Reporter:	Omer Tuchfeld <otuchfel>
Component:	Image Registry	Assignee:	Oleg Bulatov <obulatov>
Status:	CLOSED DUPLICATE	QA Contact:	XiuJuan Wang <xiuwang>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	aos-bugs
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-14 19:15:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Omer Tuchfeld 2021-12-09 19:24:22 UTC

Description of problem:
The image-pruner Job created in the openshift-image-registry namespace by the cluster-image-registry-operator sometimes fails during the periodic single-node conformance tests jobs

Version-Release number of selected component (if applicable):
So far observed during 4.9 nightly periodics but seems like it could also happen on 4.10

How reproducible:
See [2] [3] [4]

Steps to Reproduce:
See linked jobs

Actual results:
The job fails and the operator becomes degraded

Expected results:
The job should not fail / keep retrying so the operator will stop being degraded

Additional info:
It seems [1] the job has a backoff limit of 0 (according to the commit message which changed that, it was done in order to preserve logs). As a result, it tries only once, and if some issue occurs it causes the operator to be forever degraded.

I'm not sure about the exact circumstances that cause the job to fail, and why they happen on single-node clusters specifically (and those circumstances, whatever they may be, probably deserve their own separate ticket), but I would expect the Job to keep retrying even after failures - either by removing the backoff limit or having the operator re-create the job if it fails.

[1] https://github.com/openshift/cluster-image-registry-operator/blob/017bc0949bb6d5e57f3a31f4ab1d4f5213584d5c/pkg/resource/prunercronjob.go#L106

[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1467640520682508288

[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node-serial/1467278119713902592

[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1467278118946344960

Comment 1 Oleg Bulatov 2021-12-14 19:15:20 UTC


*** This bug has been marked as a duplicate of bug 1990125 ***