Bug 1890828

Summary: Intermittent prune job failures causing operator degradation
Product: OpenShift Container Platform Reporter: Matt Bargenquast <mbargenq>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: aos-bugs, jeder, nmalik, rmarasch, travi, wking
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the image pruner interrupted its work when it fails to delete an image Consequence: when two pruner were trying to delete an image concurrently, one of them fails due to the "not found" error. Fix: ignore "not found" errors Result: the pruner can torelate concurrent deletions
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:33:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Final error dump from failed image-prune job none

Description Matt Bargenquast 2020-10-23 00:48:12 UTC
Created attachment 1723639 [details]
Final error dump from failed image-prune job

Description of problem:

The nightly image-pruner job occasionally fails when pruning images. When this occurs the image-registry operator is in a degraded state until the job is removed.

Version-Release number of selected component (if applicable):

Observed in 4.6.0rc3 and 4.6.0rc4 clusters.

How reproducible:

It seems to be unpredictable. Over the course of the last week I've seen it occur on three separate clusters, not always on the same days.

On one cluster it failed two days in a row, succeeded on the third, then failed again on the fourth.

On another cluster, it failed on one day, succeeded on the subsequent day, and failed again on the day after that.

Steps to Reproduce:

Has the potential to occur during the nightly pruning job.

Actual results:

The prune job fails. The job log contains the error below. The image SHA256 checksum is different per cluster. The full stack dump from one cluster is included as an attachment.

The image-registry operator subsequently goes into a degraded state until the job is deleted.

F1023 00:00:09.700432       1 helpers.go:115] error: image sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a: failed to delete image sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a: images.image.openshift.io "sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a" not found
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc00014a700, 0x152, 0x301)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x4d2aec0, 0xc000000003, 0x0, 0x0, 0xc0000d03f0, 0x48a9af4, 0xa, 0x73, 0x41d400)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:945 +0x191
k8s.io/klog/v2.(*loggingT).printDepth(0x4d2aec0, 0x3, 0x0, 0x0, 0x2, 0xc00280d738, 0x1, 0x1)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:718 +0x165
k8s.io/klog/v2.FatalDepth(...)


Expected results:

The prune job should succeed.

Additional info:

Comment 9 Wenjing Zheng 2021-04-22 08:25:14 UTC
Verified with 4.8.0-0.nightly-2021-04-22-013545:
1. #oc edit imagepruner
spec:
  failedJobsHistoryLimit: 3
  ignoreInvalidImageReferences: false
  keepTagRevisions: 0
  keepYoungerThan: 0
  logLevel: Normal
  schedule: '*/1 * * * *'
  successfulJobsHistoryLimit: 3
  suspend: false
2. $ cat bug
#!/bin/bash
for (( i=1; i<=100; i++ ))
do
  ./oc new-project wzhengc$i
  ./oc new-app ruby~https://github.com/openshift/ruby-ex
  sleep 20
  ./oc start-build ruby-ex
  ./oc start-build ruby-ex
  ./oc start-build ruby-ex
  sleep 80
  ./oc delete imagestreamtag ruby-ex:latest
  ./oc adm prune images --keep-younger-than=0 --keep-tag-revisions=1 --prune-registry=true --confirm=true  --registry-url=default-route-openshift-image-registry.apps.qe-groupd-0422.qe.devcluster.openshift.com
done

Comment 12 errata-xmlrpc 2021-07-27 22:33:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438