1879176 – image-registry operator is Degraded after 4.4->4.5 upgrade

Bug 1879176 - image-registry operator is Degraded after 4.4->4.5 upgrade

Summary: image-registry operator is Degraded after 4.4->4.5 upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	ImageStreams
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Ricardo Maraschini
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:	1880054
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-15 15:35 UTC by Oleg Bulatov
Modified:	2020-10-12 15:48 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Pruner job degrading the image registry operator when a pod referring to an invalid image reference exists. Consequence: With the image registry operator degraded upgrades are not possible. That would force users to either remove the "offending" pods and wait for the next pruning execution or to suspend the pruner job to be able to upgrade. Fix: 1. Not to take pruner status into account when computing the operator status. 2. Added metric and alert related to the problem (pruning failing) Result: Users are able to upgrade successfully.
Clone Of:
Environment:
Last Closed:	2020-10-12 15:47:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-image-registry-operator pull 612	0	None	closed	Bug 1879176: Alerting on failed image prune job	2021-02-01 10:29:19 UTC
Red Hat Product Errata	RHBA-2020:3843	0	None	None	None	2020-10-12 15:48:20 UTC

Description Oleg Bulatov 2020-09-15 15:35:12 UTC

Description of problem:

When 4.4 cluster in upgraded to 4.5, the pruner can Degrade the operator.

It often happens because of invalid references that the pruner cannot parse.

Version-Release number of selected component (if applicable):

4.5.z

How reproducible:

Often

Steps to Reproduce:
1. create replicaset with an invalid reference (for example, `TO_BE_REPLACED` or `FOO_BAR_BAZ`), directly or via a deployment
2. wait until a new job is created for the image pruner cronjob

Actual results:

The operator becomes degraded and blocks upgrades.

Expected results:

There should be a way to unblock upgrades.

Additional info:

Comment 3 XiuJuan Wang 2020-09-28 03:13:37 UTC

Steps to verified:

1. Create replicaset with invalid image on 4.4.26
pod-pull-by-digests   0/1     InvalidImageName   0          3m53s

2.Enable imagepruner

3.Upgrade to 4.5.0-0.nightly-2020-09-26-194704

4.Check image registry clusteroperator after upgrade successfully.
the image registry co is not in degrade.

5.Check Alerting page on webconsole
ImagePrunerIsFailing is on firing

6.Query image_registry_operator_image_pruner_job_status on metrics page 
image_registry_operator_image_pruner_job_status{endpoint="60000",instance="10.129.0.10:60000",job="image-registry-operator",namespace="openshift-image-registry",pod="cluster-image-registry-operator-69bb48d877-kdfx6",service="image-registry-operator"}  1

Comment 6 errata-xmlrpc 2020-10-12 15:47:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.14 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3843

Note You need to log in before you can comment on or make changes to this bug.