1890828 – Intermittent prune job failures causing operator degradation

Bug 1890828 - Intermittent prune job failures causing operator degradation

Summary: Intermittent prune job failures causing operator degradation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Oleg Bulatov
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-23 00:48 UTC by Matt Bargenquast
Modified:	2021-07-27 22:34 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the image pruner interrupted its work when it fails to delete an image Consequence: when two pruner were trying to delete an image concurrently, one of them fails due to the "not found" error. Fix: ignore "not found" errors Result: the pruner can torelate concurrent deletions
Clone Of:
Environment:
Last Closed:	2021-07-27 22:33:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Final error dump from failed image-prune job (11.30 KB, text/plain) 2020-10-23 00:48 UTC, Matt Bargenquast	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift oc pull 805	0	None	open	Bug 1890828: Skip images that has already been deleted	2021-04-08 13:49:20 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:34:13 UTC

Description Matt Bargenquast 2020-10-23 00:48:12 UTC

Created attachment 1723639 [details]
Final error dump from failed image-prune job

Description of problem:

The nightly image-pruner job occasionally fails when pruning images. When this occurs the image-registry operator is in a degraded state until the job is removed.

Version-Release number of selected component (if applicable):

Observed in 4.6.0rc3 and 4.6.0rc4 clusters.

How reproducible:

It seems to be unpredictable. Over the course of the last week I've seen it occur on three separate clusters, not always on the same days.

On one cluster it failed two days in a row, succeeded on the third, then failed again on the fourth.

On another cluster, it failed on one day, succeeded on the subsequent day, and failed again on the day after that.

Steps to Reproduce:

Has the potential to occur during the nightly pruning job.

Actual results:

The prune job fails. The job log contains the error below. The image SHA256 checksum is different per cluster. The full stack dump from one cluster is included as an attachment.

The image-registry operator subsequently goes into a degraded state until the job is deleted.

F1023 00:00:09.700432       1 helpers.go:115] error: image sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a: failed to delete image sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a: images.image.openshift.io "sha256:e57c71e6bb180424e5d4a27d629609847686385880105fb52fc8fe190ff57a4a" not found
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc00014a700, 0x152, 0x301)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x4d2aec0, 0xc000000003, 0x0, 0x0, 0xc0000d03f0, 0x48a9af4, 0xa, 0x73, 0x41d400)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:945 +0x191
k8s.io/klog/v2.(*loggingT).printDepth(0x4d2aec0, 0x3, 0x0, 0x0, 0x2, 0xc00280d738, 0x1, 0x1)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:718 +0x165
k8s.io/klog/v2.FatalDepth(...)


Expected results:

The prune job should succeed.

Additional info:

Comment 9 Wenjing Zheng 2021-04-22 08:25:14 UTC

Verified with 4.8.0-0.nightly-2021-04-22-013545:
1. #oc edit imagepruner
spec:
  failedJobsHistoryLimit: 3
  ignoreInvalidImageReferences: false
  keepTagRevisions: 0
  keepYoungerThan: 0
  logLevel: Normal
  schedule: '*/1 * * * *'
  successfulJobsHistoryLimit: 3
  suspend: false
2. $ cat bug
#!/bin/bash
for (( i=1; i<=100; i++ ))
do
  ./oc new-project wzhengc$i
  ./oc new-app ruby~https://github.com/openshift/ruby-ex
  sleep 20
  ./oc start-build ruby-ex
  ./oc start-build ruby-ex
  ./oc start-build ruby-ex
  sleep 80
  ./oc delete imagestreamtag ruby-ex:latest
  ./oc adm prune images --keep-younger-than=0 --keep-tag-revisions=1 --prune-registry=true --confirm=true  --registry-url=default-route-openshift-image-registry.apps.qe-groupd-0422.qe.devcluster.openshift.com
done

Comment 12 errata-xmlrpc 2021-07-27 22:33:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.