2030821 – cluster-image-registry-operator image-pruner Job sometimes fails in single node clusters

Bug 2030821 - cluster-image-registry-operator image-pruner Job sometimes fails in single node clusters

Summary: cluster-image-registry-operator image-pruner Job sometimes fails in single no...

Keywords:
Status:	CLOSED DUPLICATE of bug 1990125
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Oleg Bulatov
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-09 19:24 UTC by Omer Tuchfeld
Modified:	2021-12-14 19:15 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-14 19:15:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Omer Tuchfeld 2021-12-09 19:24:22 UTC

Description of problem:
The image-pruner Job created in the openshift-image-registry namespace by the cluster-image-registry-operator sometimes fails during the periodic single-node conformance tests jobs

Version-Release number of selected component (if applicable):
So far observed during 4.9 nightly periodics but seems like it could also happen on 4.10

How reproducible:
See [2] [3] [4]

Steps to Reproduce:
See linked jobs

Actual results:
The job fails and the operator becomes degraded

Expected results:
The job should not fail / keep retrying so the operator will stop being degraded

Additional info:
It seems [1] the job has a backoff limit of 0 (according to the commit message which changed that, it was done in order to preserve logs). As a result, it tries only once, and if some issue occurs it causes the operator to be forever degraded.

I'm not sure about the exact circumstances that cause the job to fail, and why they happen on single-node clusters specifically (and those circumstances, whatever they may be, probably deserve their own separate ticket), but I would expect the Job to keep retrying even after failures - either by removing the backoff limit or having the operator re-create the job if it fails.

[1] https://github.com/openshift/cluster-image-registry-operator/blob/017bc0949bb6d5e57f3a31f4ab1d4f5213584d5c/pkg/resource/prunercronjob.go#L106

[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1467640520682508288

[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node-serial/1467278119713902592

[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1467278118946344960

Comment 1 Oleg Bulatov 2021-12-14 19:15:20 UTC


*** This bug has been marked as a duplicate of bug 1990125 ***

Note You need to log in before you can comment on or make changes to this bug.