1990125 – co/image-registry is degrade because ImagePrunerDegraded: Job has reached the specified backoff limit

Bug 1990125 - co/image-registry is degrade because ImagePrunerDegraded: Job has reached the specified backoff limit

Summary: co/image-registry is degrade because ImagePrunerDegraded: Job has reached the...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Oleg Bulatov
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1985192 1999564 2002156 2030821 (view as bug list)
Depends On:
Blocks:	2051692
TreeView+	depends on / blocked

Reported:	2021-08-04 19:43 UTC by Hongkai Liu
Modified:	2024-12-20 20:37 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: retry to run the pruner if it fails Reason: if the pruner fails, the image-registry operator reports itself as Degraded until a successful run of the pruner (by default it's run once a day) Result: the operator is more resilient to the pruner failures
Clone Of:
Environment:
Last Closed:	2022-08-10 10:36:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
co.object.and.prune.job.log (10.14 KB, application/zip) 2021-08-04 19:43 UTC, Hongkai Liu	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-image-registry-operator pull 754	None	Merged	Bug 1990125: Retry on pruner failures	2022-02-07 19:49:44 UTC
Red Hat Knowledge Base (Solution)	5367681	None	None	None	2021-10-29 23:16:24 UTC
Red Hat Knowledge Base (Solution)	6786021	None	None	None	2022-03-07 07:03:25 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:37:07 UTC

Internal Links: 2055857

Description Hongkai Liu 2021-08-04 19:43:11 UTC

Created attachment 1810990 [details]
co.object.and.prune.job.log

Description of problem:
App.CI is the OSD cluster hosting the Prow services for CI, and the CI central registry.

oc get clusterversion version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.3     True        False         2d3h    Error while reconciling 4.8.3: the cluster operator image-registry is degraded
oc get co image-registry
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry   4.8.3     True        False         True       116d
message: 'ImagePrunerDegraded: Job has reached the specified backoff limit'


Version-Release number of selected component (if applicable):


How reproducible:
Happened very often since 4.8

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
I would like to know how to alleviate the issue because 
a degraded CO blocks cluster upgrades.
https://coreos.slack.com/archives/CCX9DB894/p1627677014209800?thread_ts=1627672834.195200&cid=CCX9DB894

Additional info:
The current theory of the failing pruner job is caused by busy API servers.
https://coreos.slack.com/archives/CHY2E1BL4/p1627480284014400?thread_ts=1627480035.013300&cid=CHY2E1BL4

which in turn might be caused by huge imagestream objects used by the release controller.
https://coreos.slack.com/archives/CCX9DB894/p1627938902430100?thread_ts=1627926355.395700&cid=CCX9DB894

Comment 1 Steve Kuznetsov 2021-08-06 15:31:42 UTC

Degraded cluster operator is blocking the test platform team from upgrading these clusters, setting to urgent.

Comment 6 Oleg Bulatov 2021-09-21 14:38:21 UTC

*** Bug 2002156 has been marked as a duplicate of this bug. ***

Comment 7 Oleg Bulatov 2021-10-11 14:34:41 UTC

*** Bug 1985192 has been marked as a duplicate of this bug. ***

Comment 8 Oleg Bulatov 2021-10-11 14:58:39 UTC

*** Bug 1999564 has been marked as a duplicate of this bug. ***

Comment 10 dofinn 2021-10-29 03:19:38 UTC

We have been seeing this across our OSD/ROSA fleet. The image-pruner job fails potentially generating 2 alerts. 

1. KubeJobFailed -> image-pruner
2. ClusterOperatorDegraded WARNING (image-registry-operator)

```
[~ {production} (rhmi2-staging2:default)]$ oc -n openshift-image-registry logs image-pruner-1635472800-trg4x                                                                                               [15/1141]

I1029 02:01:03.978205       1 prune.go:348] Creating image pruner with keepYoungerThan=24h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
I1029 02:01:04.067657       1 prune.go:474] pod/delete-pipelineruns-1635471600-rfmrl namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.s
vc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.068603       1 prune.go:474] pod/delete-pipelineruns-1635472200-fbvmk namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.s
vc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.068764       1 prune.go:474] pod/process-pagerduty-services-1635472200-65kbl namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-im
age-registry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.085061       1 prune.go:474] job/delete-pipelineruns-1635471000 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500
0/$environment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.085161       1 prune.go:474] job/delete-pipelineruns-1635471600 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500
0/$environment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.085212       1 prune.go:474] job/delete-pipelineruns-1635472200 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500
0/$environment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.085263       1 prune.go:474] job/process-pagerduty-services-1635471600 namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-re
gistry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.085314       1 prune.go:474] job/process-pagerduty-services-1635472200 namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-re
gistry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.085918       1 prune.go:474] cronjob/delete-pipelineruns namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:5000/$envi
ronment/cssre-pipelines-builder": invalid reference format - skipping
I1029 02:01:04.085991       1 prune.go:474] cronjob/process-pagerduty-services namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-registry.
svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping
Deleting blob sha256:4498f61e4ddd5a6b5356d252c9f530ca8081debb8d43a8ea1b666d61a9f30215
Deleting blob sha256:37811881e67e1cf0752b30f6dc7976d47f6f5b9f9df419d58b7d4acb6abc6132
Deleting blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d
Deleting image sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d
error deleting blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d from the registry: 400 Bad Request
Summary: deleted 1 image object(s), deleted 2 blob(s)
imagestream openshift/fis-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fis-karaf-openshift)
imagestream openshift/rhdm-kieserver-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io rhdm-kieserver-rhel8)
imagestream openshift/mongodb: the server is currently unable to handle the request (get imagestreams.image.openshift.io mongodb)
imagestream openshift/tools: the server is currently unable to handle the request (get imagestreams.image.openshift.io tools)
imagestream openshift/php: the server is currently unable to handle the request (get imagestreams.image.openshift.io php)
imagestream openshift/rhpam-businesscentral-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io rhpam-businesscentral-rhel8)
imagestream openshift/jboss-datagrid65-client-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datagrid65-client-openshift)
imagestream openshift/postgresql: the server is currently unable to handle the request (get imagestreams.image.openshift.io postgresql)
imagestream openshift/ubi8-openjdk-11: the server is currently unable to handle the request (get imagestreams.image.openshift.io ubi8-openjdk-11)
imagestream openshift/jboss-webserver54-openjdk8-tomcat9-openshift-rhel7: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-webserver54-openjdk8-tomcat9-openshift-rhe
l7)
imagestream openshift/redhat-openjdk18-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io redhat-openjdk18-openshift)
imagestream openshift/jboss-processserver64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-processserver64-openshift)
imagestream openshift/eap-cd-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io eap-cd-openshift)
imagestream openshift/jboss-datagrid71-client-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datagrid71-client-openshift)
imagestream openshift/jboss-datavirt64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datavirt64-openshift)
imagestream openshift/jboss-webserver30-tomcat7-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-webserver30-tomcat7-openshift)
imagestream cssre-pipelines-staging/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder)
imagestream cssre-pipelines-staging/webhook-proxy: the server is currently unable to handle the request (get imagestreams.image.openshift.io webhook-proxy)
imagestream openshift/java: the server is currently unable to handle the request (get imagestreams.image.openshift.io java)
imagestream openshift/jenkins: the server is currently unable to handle the request (get imagestreams.image.openshift.io jenkins)
imagestream openshift/fuse7-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fuse7-karaf-openshift)
imagestream openshift/jboss-fuse70-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-fuse70-karaf-openshift)
imagestream openshift/installer: the server is currently unable to handle the request (get imagestreams.image.openshift.io installer)
imagestream openshift/ubi8-openjdk-8: the server is currently unable to handle the request (get imagestreams.image.openshift.io ubi8-openjdk-8)
imagestream openshift/golang: the server is currently unable to handle the request (get imagestreams.image.openshift.io golang)
imagestream openshift/jboss-eap64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-eap64-openshift)
imagestream openshift/redhat-sso71-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io redhat-sso71-openshift)
imagestream openshift/openjdk-11-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io openjdk-11-rhel8)
imagestream cssre-pipelines/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder)
imagestream openshift/fis-java-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fis-java-openshift)
imagestream openshift/jenkins-agent-nodejs: the server is currently unable to handle the request (get imagestreams.image.openshift.io jenkins-agent-nodejs)
imagestream psaggu-development/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder)
image sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d: failed to delete manifest blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d: 400 Bad Request
```

Comment 11 W. Trevor King 2021-10-29 23:16:24 UTC

Heh, somehow I subscribed myself to this without dropping some useful links.  Better late then never:

https://github.com/openshift/openshift-docs/pull/37229
https://access.redhat.com/solutions/5367681

This can possibly be closed as a dup of ... or bug 1871251?

Comment 12 Devan Goodwin 2021-11-26 12:55:11 UTC

Massive increase in hits for this error across all CI jobs starting Nov 24th in the afternoon. https://search.ci.openshift.org/chart?search=ImagePrunerDegraded%3A+Job+has+reached+the+specified+backoff+limit&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Coincidentally have seen some other AWS issues start spiking hard at this time. (https://bugzilla.redhat.com/show_bug.cgi?id=1898265 for example)

Comment 13 Oleg Bulatov 2021-12-14 19:15:20 UTC

*** Bug 2030821 has been marked as a duplicate of this bug. ***

Comment 14 XiuJuan Wang 2021-12-21 06:26:43 UTC

Met this issue during 4.10 installation, but less rate to reproduce.

Comment 17 XiuJuan Wang 2022-02-08 06:52:19 UTC

Since the api timeout is hard to reproduce, I choose to set ignoreInvalidImageReferences:false to test on 4.11.0-0.nightly-2022-02-07-232639 cluster.

1.set 
      ignoreInvalidImageReferences:false
      schedule: '* * * * *'
2. Create a pod with invalid image name
3.Check the imagepruner pod, it will retry 5 times

 $oc logs -f image-pruner-27405016-9g2b2
I0208 06:16:03.650686       7 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
Failed to build graph!

The following objects have invalid references:

  pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format

Either fix the references or delete the objects to make the pruner proceed.
error: failed to build graph - no changes made
attempt #1 has failed (exit code 1), going to make another attempt...
I0208 06:16:34.282692      16 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
Failed to build graph!

The following objects have invalid references:

  pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format

Either fix the references or delete the objects to make the pruner proceed.
error: failed to build graph - no changes made
attempt #2 has failed (exit code 1), going to make another attempt...
I0208 06:17:34.866676      25 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
Failed to build graph!

The following objects have invalid references:

  pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format

Either fix the references or delete the objects to make the pruner proceed.
error: failed to build graph - no changes made
attempt #3 has failed (exit code 1), going to make another attempt...
I0208 06:19:05.503129      34 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
Failed to build graph!

The following objects have invalid references:

  pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format

Either fix the references or delete the objects to make the pruner proceed.
error: failed to build graph - no changes made
attempt #4 has failed (exit code 1), going to make another attempt...
I0208 06:21:06.101268      43 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
Failed to build graph!

The following objects have invalid references:

  pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format

Either fix the references or delete the objects to make the pruner proceed.
error: failed to build graph - no changes made
attempt #5 has failed (exit code 1), going to make another attempt...
I0208 06:23:36.760677      52 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
Failed to build graph!

The following objects have invalid references:

  pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format

Either fix the references or delete the objects to make the pruner proceed.
error: failed to build graph - no changes made

Comment 23 errata-xmlrpc 2022-08-10 10:36:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.