Created attachment 1810990 [details] co.object.and.prune.job.log Description of problem: App.CI is the OSD cluster hosting the Prow services for CI, and the CI central registry. oc get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.3 True False 2d3h Error while reconciling 4.8.3: the cluster operator image-registry is degraded oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE image-registry 4.8.3 True False True 116d message: 'ImagePrunerDegraded: Job has reached the specified backoff limit' Version-Release number of selected component (if applicable): How reproducible: Happened very often since 4.8 Steps to Reproduce: 1. 2. 3. Actual results: Expected results: I would like to know how to alleviate the issue because a degraded CO blocks cluster upgrades. https://coreos.slack.com/archives/CCX9DB894/p1627677014209800?thread_ts=1627672834.195200&cid=CCX9DB894 Additional info: The current theory of the failing pruner job is caused by busy API servers. https://coreos.slack.com/archives/CHY2E1BL4/p1627480284014400?thread_ts=1627480035.013300&cid=CHY2E1BL4 which in turn might be caused by huge imagestream objects used by the release controller. https://coreos.slack.com/archives/CCX9DB894/p1627938902430100?thread_ts=1627926355.395700&cid=CCX9DB894
Degraded cluster operator is blocking the test platform team from upgrading these clusters, setting to urgent.
*** Bug 2002156 has been marked as a duplicate of this bug. ***
*** Bug 1985192 has been marked as a duplicate of this bug. ***
*** Bug 1999564 has been marked as a duplicate of this bug. ***
We have been seeing this across our OSD/ROSA fleet. The image-pruner job fails potentially generating 2 alerts. 1. KubeJobFailed -> image-pruner 2. ClusterOperatorDegraded WARNING (image-registry-operator) ``` [~ {production} (rhmi2-staging2:default)]$ oc -n openshift-image-registry logs image-pruner-1635472800-trg4x [15/1141] I1029 02:01:03.978205 1 prune.go:348] Creating image pruner with keepYoungerThan=24h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true I1029 02:01:04.067657 1 prune.go:474] pod/delete-pipelineruns-1635471600-rfmrl namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.s vc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.068603 1 prune.go:474] pod/delete-pipelineruns-1635472200-fbvmk namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.s vc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.068764 1 prune.go:474] pod/process-pagerduty-services-1635472200-65kbl namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-im age-registry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085061 1 prune.go:474] job/delete-pipelineruns-1635471000 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500 0/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085161 1 prune.go:474] job/delete-pipelineruns-1635471600 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500 0/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085212 1 prune.go:474] job/delete-pipelineruns-1635472200 namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:500 0/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085263 1 prune.go:474] job/process-pagerduty-services-1635471600 namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-re gistry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085314 1 prune.go:474] job/process-pagerduty-services-1635472200 namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-re gistry.svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085918 1 prune.go:474] cronjob/delete-pipelineruns namespace=psaggu-development: container delete-pipelineruns: invalid image reference "image-registry.openshift-image-registry.svc:5000/$envi ronment/cssre-pipelines-builder": invalid reference format - skipping I1029 02:01:04.085991 1 prune.go:474] cronjob/process-pagerduty-services namespace=psaggu-development: container process-pagerduty-services: invalid image reference "image-registry.openshift-image-registry. svc:5000/$environment/cssre-pipelines-builder": invalid reference format - skipping Deleting blob sha256:4498f61e4ddd5a6b5356d252c9f530ca8081debb8d43a8ea1b666d61a9f30215 Deleting blob sha256:37811881e67e1cf0752b30f6dc7976d47f6f5b9f9df419d58b7d4acb6abc6132 Deleting blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d Deleting image sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d error deleting blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d from the registry: 400 Bad Request Summary: deleted 1 image object(s), deleted 2 blob(s) imagestream openshift/fis-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fis-karaf-openshift) imagestream openshift/rhdm-kieserver-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io rhdm-kieserver-rhel8) imagestream openshift/mongodb: the server is currently unable to handle the request (get imagestreams.image.openshift.io mongodb) imagestream openshift/tools: the server is currently unable to handle the request (get imagestreams.image.openshift.io tools) imagestream openshift/php: the server is currently unable to handle the request (get imagestreams.image.openshift.io php) imagestream openshift/rhpam-businesscentral-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io rhpam-businesscentral-rhel8) imagestream openshift/jboss-datagrid65-client-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datagrid65-client-openshift) imagestream openshift/postgresql: the server is currently unable to handle the request (get imagestreams.image.openshift.io postgresql) imagestream openshift/ubi8-openjdk-11: the server is currently unable to handle the request (get imagestreams.image.openshift.io ubi8-openjdk-11) imagestream openshift/jboss-webserver54-openjdk8-tomcat9-openshift-rhel7: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-webserver54-openjdk8-tomcat9-openshift-rhe l7) imagestream openshift/redhat-openjdk18-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io redhat-openjdk18-openshift) imagestream openshift/jboss-processserver64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-processserver64-openshift) imagestream openshift/eap-cd-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io eap-cd-openshift) imagestream openshift/jboss-datagrid71-client-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datagrid71-client-openshift) imagestream openshift/jboss-datavirt64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-datavirt64-openshift) imagestream openshift/jboss-webserver30-tomcat7-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-webserver30-tomcat7-openshift) imagestream cssre-pipelines-staging/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder) imagestream cssre-pipelines-staging/webhook-proxy: the server is currently unable to handle the request (get imagestreams.image.openshift.io webhook-proxy) imagestream openshift/java: the server is currently unable to handle the request (get imagestreams.image.openshift.io java) imagestream openshift/jenkins: the server is currently unable to handle the request (get imagestreams.image.openshift.io jenkins) imagestream openshift/fuse7-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fuse7-karaf-openshift) imagestream openshift/jboss-fuse70-karaf-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-fuse70-karaf-openshift) imagestream openshift/installer: the server is currently unable to handle the request (get imagestreams.image.openshift.io installer) imagestream openshift/ubi8-openjdk-8: the server is currently unable to handle the request (get imagestreams.image.openshift.io ubi8-openjdk-8) imagestream openshift/golang: the server is currently unable to handle the request (get imagestreams.image.openshift.io golang) imagestream openshift/jboss-eap64-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io jboss-eap64-openshift) imagestream openshift/redhat-sso71-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io redhat-sso71-openshift) imagestream openshift/openjdk-11-rhel8: the server is currently unable to handle the request (get imagestreams.image.openshift.io openjdk-11-rhel8) imagestream cssre-pipelines/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder) imagestream openshift/fis-java-openshift: the server is currently unable to handle the request (get imagestreams.image.openshift.io fis-java-openshift) imagestream openshift/jenkins-agent-nodejs: the server is currently unable to handle the request (get imagestreams.image.openshift.io jenkins-agent-nodejs) imagestream psaggu-development/cssre-pipelines-builder: the server is currently unable to handle the request (get imagestreams.image.openshift.io cssre-pipelines-builder) image sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d: failed to delete manifest blob sha256:e39535eda81cd415d40669c8e3129e290ef1661e20ba2f2fdbf9656a887fde4d: 400 Bad Request ```
Heh, somehow I subscribed myself to this without dropping some useful links. Better late then never: https://github.com/openshift/openshift-docs/pull/37229 https://access.redhat.com/solutions/5367681 This can possibly be closed as a dup of ... or bug 1871251?
Massive increase in hits for this error across all CI jobs starting Nov 24th in the afternoon. https://search.ci.openshift.org/chart?search=ImagePrunerDegraded%3A+Job+has+reached+the+specified+backoff+limit&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Coincidentally have seen some other AWS issues start spiking hard at this time. (https://bugzilla.redhat.com/show_bug.cgi?id=1898265 for example)
*** Bug 2030821 has been marked as a duplicate of this bug. ***
Met this issue during 4.10 installation, but less rate to reproduce.
Since the api timeout is hard to reproduce, I choose to set ignoreInvalidImageReferences:false to test on 4.11.0-0.nightly-2022-02-07-232639 cluster. 1.set ignoreInvalidImageReferences:false schedule: '* * * * *' 2. Create a pod with invalid image name 3.Check the imagepruner pod, it will retry 5 times $oc logs -f image-pruner-27405016-9g2b2 I0208 06:16:03.650686 7 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #1 has failed (exit code 1), going to make another attempt... I0208 06:16:34.282692 16 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #2 has failed (exit code 1), going to make another attempt... I0208 06:17:34.866676 25 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #3 has failed (exit code 1), going to make another attempt... I0208 06:19:05.503129 34 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #4 has failed (exit code 1), going to make another attempt... I0208 06:21:06.101268 43 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made attempt #5 has failed (exit code 1), going to make another attempt... I0208 06:23:36.760677 52 prune.go:347] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true Failed to build graph! The following objects have invalid references: pod/prune1 namespace=wxj: container prune1: invalid image reference "quay.io/openshifttest/hello-pod@sha:123": invalid reference format Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069