1857684 – operator interprets running pruning job as success

Bug 1857684 - operator interprets running pruning job as success

Summary: operator interprets running pruning job as success

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Ricardo Maraschini
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1857687 (view as bug list)
Depends On:
Blocks:	1873496
TreeView+	depends on / blocked

Reported:	2020-07-16 10:58 UTC by Oleg Bulatov
Modified:	2020-10-27 16:15 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Operator was taking into account "running" jobs when deriving its own status. Consequence: Running jobs may be not in a failure state yet, therefore the operator could report itself as healthy while the job was still running. Fix: Ignoring running jobs when deriving the operator status. Result: Operator now uses the status of the last finished job always thus reporting its status on the right way.
Clone Of:
Clones:	1873534 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:15:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-image-registry-operator pull 586	0	None	closed	Bug 1857684: Using last finished job status and disabling retries	2021-02-01 02:47:29 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:15:45 UTC

Description Oleg Bulatov 2020-07-16 10:58:58 UTC

Description of problem:

When the pruner jobs has persistent problem, the operator from time to time can report that the pruner is healthy. It happens when the running job hasn't failed yet. Another problem is that failed pods are automatically removed so we can't check their log output.

Version-Release number of selected component (if applicable):

4.4+?

How reproducible:

Always.

Steps to Reproduce:

1. Create a deployment with an image reference that the pruner cannot parse.
2. Wait until the pruner fails to parse it.
3. Watch the operator conditions.
4. After the failure try to locate the job pod and read it's output (it won't exist)


Actual results:

The operator flakes and we can't read the job output log.

Expected results:

The operator stays Degraded and we can't see why it is degraded (by inspecting the pod log).

Additional info:

Comment 4 Wenjing Zheng 2020-08-20 09:37:05 UTC

Verified on 4.6.0-0.nightly-2020-08-18-165040:
1. Make image pruner degrade;
2. Create a deployment with invalid image name
3. Watch image registry status: it remains degrade.

Comment 5 Ricardo Maraschini 2020-08-28 13:57:17 UTC

*** Bug 1857687 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2020-10-27 16:15:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.