Bug 1990125
Summary: | co/image-registry is degrade because ImagePrunerDegraded: Job has reached the specified backoff limit | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hongkai Liu <hongkliu> | ||||
Component: | Image Registry | Assignee: | Oleg Bulatov <obulatov> | ||||
Status: | CLOSED ERRATA | QA Contact: | XiuJuan Wang <xiuwang> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.8 | CC: | achernet, aos-bugs, bzhai, bzvonar, dgoodwin, dofinn, imatza, skuznets, sreber, travi, vyoganan, wking, xiuwang | ||||
Target Milestone: | --- | Keywords: | ServiceDeliveryImpact | ||||
Target Release: | 4.11.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Enhancement | |||||
Doc Text: |
Feature: retry to run the pruner if it fails
Reason: if the pruner fails, the image-registry operator reports itself as Degraded until a successful run of the pruner (by default it's run once a day)
Result: the operator is more resilient to the pruner failures
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-08-10 10:36:53 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2051692 | ||||||
Attachments: |
|
Description
Hongkai Liu
2021-08-04 19:43:11 UTC
Degraded cluster operator is blocking the test platform team from upgrading these clusters, setting to urgent. *** Bug 2002156 has been marked as a duplicate of this bug. *** *** Bug 1985192 has been marked as a duplicate of this bug. *** *** Bug 1999564 has been marked as a duplicate of this bug. *** We have been seeing this across our OSD/ROSA fleet. The image-pruner job fails potentially generating 2 alerts. 1. KubeJobFailed -> image-pruner 2. ClusterOperatorDegraded WARNING (image-registry-operator) ``` [~ {production} (rhmi2-staging2:default)]$ oc -n openshift-image-registry logs image-pruner-1635472800-trg4x [15/1141] I1029 02:01:03.978205 1 prune.go:348] Creating image pruner with keepYoungerThan=24h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true I1029 02:01:04.067657 1 prune.go:474] pod/delete-pipelineruns-1635471600-rfmrl namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.s vc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.068603 1 prune.go:474] pod/delete-pipelineruns-1635472200-fbvmk namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.s vc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.068764 1 prune.go:474] pod/process-pagerduty-services-1635472200-65kbl namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-im age-registry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085061 1 prune.go:474] job/delete-pipelineruns-1635471000 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500 0/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085161 1 prune.go:474] job/delete-pipelineruns-1635471600 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500 0/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085212 1 prune.go:474] job/delete-pipelineruns-1635472200 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500 0/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085263 1 prune.go:474] job/process-pagerduty-services-1635471600 namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-re gistry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085314 1 prune.go:474] job/process-pagerduty-services-1635472200 namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-re gistry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085918 1 prune.go:474] cronjob/delete-pipelineruns namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:5000/$envi ronment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085991 1 prune.go:474] cronjob/process-pagerduty-services namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-registry. svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping Deleting blob sha256:4498f61e4ddd5a6b5356d252c9f530ca8081debb8d43a8ea1b666d61a9f30215 Deleting blob sha256:37811881e67e1cf0752b30f6dc7976d47f6f5b9f9df419d58b7d4acb6abc6132 Deleting blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d Deleting image sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d error deleting blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d from the registry: 400 Bad Request Summary: deleted 1 image object(s), deleted 2 blob(s) imagestream openshift/fis-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fis-karaf-openshift) imagestream openshift/rhdm-kieserver-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io rhdm-kieserver-rhel8) imagestream openshift/mongodb: the server is currently unable to handle the request (get imagestreams.image.openshift.io mongodb) imagestream openshift/tools: the server is currently unable to handle the request (get imagestreams.image.openshift.io tools) imagestream openshift/php: the server is currently unable to handle the request (get imagestreams.image.openshift.io php) imagestream openshift/rhpam-businesscentral-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io rhpam-businesscentral-rhel8) imagestream openshift/jboss-datagrid65-client-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datagrid65-client-openshift) imagestream openshift/postgresql: the server is currently unable to handle the request (get imagestreams.image.openshift.io postgresql) imagestream openshift/ubi8-openjdk-11: the server is currently unable to handle the request (get imagestreams.image.openshift.io ubi8-openjdk-11) imagestream openshift/jboss-webserver54-openjdk8-tomcat9-openshift-rhel7: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-webserver54-openjdk8-tomcat9-openshift-rhe l7) imagestream openshift/redhat-openjdk18-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io redhat-openjdk18-openshift) imagestream openshift/jboss-processserver64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-processserver64-openshift) imagestream openshift/eap-cd-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io eap-cd-openshift) imagestream openshift/jboss-datagrid71-client-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datagrid71-client-openshift) imagestream openshift/jboss-datavirt64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datavirt64-openshift) imagestream openshift/jboss-webserver30-tomcat7-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-webserver30-tomcat7-openshift) imagestream cssre-pipelines-staging/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder) imagestream cssre-pipelines-staging/webhook-proxy: the server is currently unable to handle the request (get imagestreams.image.openshift.io webhook-proxy) imagestream openshift/java: the server is currently unable to handle the request (get imagestreams.image.openshift.io java) imagestream openshift/jenkins: the server is currently unable to handle the request (get imagestreams.image.openshift.io jenkins) imagestream openshift/fuse7-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fuse7-karaf-openshift) imagestream openshift/jboss-fuse70-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-fuse70-karaf-openshift) imagestream openshift/installer: the server is currently unable to handle the request (get imagestreams.image.openshift.io installer) imagestream openshift/ubi8-openjdk-8: the server is currently unable to handle the request (get imagestreams.image.openshift.io ubi8-openjdk-8) imagestream openshift/golang: the server is currently unable to handle the request (get imagestreams.image.openshift.io golang) imagestream openshift/jboss-eap64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-eap64-openshift) imagestream openshift/redhat-sso71-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io redhat-sso71-openshift) imagestream openshift/openjdk-11-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io openjdk-11-rhel8) imagestream cssre-pipelines/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder) imagestream openshift/fis-java-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fis-java-openshift) imagestream openshift/jenkins-agent-nodejs: the server is currently unable to handle the request (get imagestreams.image.openshift.io jenkins-agent-nodejs) imagestream psaggu-development/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder) image sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d: failed to delete manifest blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d: 400 Bad Request ``` Heh, somehow I subscribed myself to this without dropping some useful links. Better late then never: https://github.com/openshift/openshift-docs/pull/37229 https://access.redhat.com/solutions/5367681 This can possibly be closed as a dup of ... or bug 1871251? Massive increase in hits for this error across all CI jobs starting Nov 24th in the afternoon. https://search.ci.openshift.org/chart?search=ImagePrunerDegraded%3A+Job+has+reached+the+specified+backoff+limit&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Coincidentally have seen some other AWS issues start spiking hard at this time. (https://bugzilla.redhat.com/show_bug.cgi?id=1898265 for example) *** Bug 2030821 has been marked as a duplicate of this bug. *** Met this issue during 4.10 installation, but less rate to reproduce. Since the api timeout is hard to reproduce, I choose to set ignoreInvalidImageReferences:false to test on 4.11.0-0.nightly-2022-02-07-232639 cluster. 1.set ignoreInvalidImageReferences:false schedule: '* * * * *' 2. Create a pod with invalid image name 3.Check the imagepruner pod, it will retry 5 times $oc logs -f image-pruner-27405016-9g2b2 I0208 06:16:03.650686 7 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #1 has failed (exit code 1), going to make another attempt... I0208 06:16:34.282692 16 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #2 has failed (exit code 1), going to make another attempt... I0208 06:17:34.866676 25 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #3 has failed (exit code 1), going to make another attempt... I0208 06:19:05.503129 34 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #4 has failed (exit code 1), going to make another attempt... I0208 06:21:06.101268 43 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #5 has failed (exit code 1), going to make another attempt... I0208 06:23:36.760677 52 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |