Bug 1890828 - Intermittent prune job failures causing operator degradation
Summary: Intermittent prune job failures causing operator degradation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-23 00:48 UTC by Matt Bargenquast
Modified: 2021-07-27 22:34 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the image pruner interrupted its work when it fails to delete an image Consequence: when two pruner were trying to delete an image concurrently, one of them fails due to the "not found" error. Fix: ignore "not found" errors Result: the pruner can torelate concurrent deletions
Clone Of:
Environment:
Last Closed: 2021-07-27 22:33:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Final error dump from failed image-prune job (11.30 KB, text/plain)
2020-10-23 00:48 UTC, Matt Bargenquast
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift oc pull 805 0 None open Bug 1890828: Skip images that has already been deleted 2021-04-08 13:49:20 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:34:13 UTC

Description Matt Bargenquast 2020-10-23 00:48:12 UTC
Created attachment 1723639 [details]
Final error dump from failed image-prune job

Description of problem:

The nightly image-pruner job occasionally fails when pruning images. When this occurs the image-registry operator is in a degraded state until the job is removed.

Version-Release number of selected component (if applicable):

Observed in 4.6.0rc3 and 4.6.0rc4 clusters.

How reproducible:

It seems to be unpredictable. Over the course of the last week I've seen it occur on three separate clusters, not always on the same days.

On one cluster it failed two days in a row, succeeded on the third, then failed again on the fourth.

On another cluster, it failed on one day, succeeded on the subsequent day, and failed again on the day after that.

Steps to Reproduce:

Has the potential to occur during the nightly pruning job.

Actual results:

The prune job fails. The job log contains the error below. The image SHA256 checksum is different per cluster. The full stack dump from one cluster is included as an attachment.

The image-registry operator subsequently goes into a degraded state until the job is deleted.

F1023 00:00:09.700432       1 helpers.go:115] error: image sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a: failed to delete image sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a: images.image.openshift.io "sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a" not found
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc00014a700, 0x152, 0x301)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x4d2aec0, 0xc000000003, 0x0, 0x0, 0xc0000d03f0, 0x48a9af4, 0xa, 0x73, 0x41d400)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:945 +0x191
k8s.io/klog/v2.(*loggingT).printDepth(0x4d2aec0, 0x3, 0x0, 0x0, 0x2, 0xc00280d738, 0x1, 0x1)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:718 +0x165
k8s.io/klog/v2.FatalDepth(...)


Expected results:

The prune job should succeed.

Additional info:

Comment 9 Wenjing Zheng 2021-04-22 08:25:14 UTC
Verified with 4.8.0-0.nightly-2021-04-22-013545:
1. #oc edit imagepruner
spec:
  failedJobsHistoryLimit: 3
  ignoreInvalidImageReferences: false
  keepTagRevisions: 0
  keepYoungerThan: 0
  logLevel: Normal
  schedule: '*/1 * * * *'
  successfulJobsHistoryLimit: 3
  suspend: false
2. $ cat bug
#!/bin/bash
for (( i=1; i<=100; i++ ))
do
  ./oc new-project wzhengc$i
  ./oc new-app ruby~https://github.com/openshift/ruby-ex
  sleep 20
  ./oc start-build ruby-ex
  ./oc start-build ruby-ex
  ./oc start-build ruby-ex
  sleep 80
  ./oc delete imagestreamtag ruby-ex:latest
  ./oc adm prune images --keep-younger-than=0 --keep-tag-revisions=1 --prune-registry=true --confirm=true  --registry-url=default-route-openshift-image-registry.apps.qe-groupd-0422.qe.devcluster.openshift.com
done

Comment 12 errata-xmlrpc 2021-07-27 22:33:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.