Bug 1857684 - operator interprets running pruning job as success
Summary: operator interprets running pruning job as success
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Ricardo Maraschini
QA Contact: Wenjing Zheng
URL:
Whiteboard:
: 1857687 (view as bug list)
Depends On:
Blocks: 1873496
TreeView+ depends on / blocked
 
Reported: 2020-07-16 10:58 UTC by Oleg Bulatov
Modified: 2020-10-27 16:15 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Operator was taking into account "running" jobs when deriving its own status. Consequence: Running jobs may be not in a failure state yet, therefore the operator could report itself as healthy while the job was still running. Fix: Ignoring running jobs when deriving the operator status. Result: Operator now uses the status of the last finished job always thus reporting its status on the right way.
Clone Of:
: 1873534 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:15:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 586 0 None closed Bug 1857684: Using last finished job status and disabling retries 2021-02-01 02:47:29 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:15:45 UTC

Description Oleg Bulatov 2020-07-16 10:58:58 UTC
Description of problem:

When the pruner jobs has persistent problem, the operator from time to time can report that the pruner is healthy. It happens when the running job hasn't failed yet. Another problem is that failed pods are automatically removed so we can't check their log output.

Version-Release number of selected component (if applicable):

4.4+?

How reproducible:

Always.

Steps to Reproduce:

1. Create a deployment with an image reference that the pruner cannot parse.
2. Wait until the pruner fails to parse it.
3. Watch the operator conditions.
4. After the failure try to locate the job pod and read it's output (it won't exist)


Actual results:

The operator flakes and we can't read the job output log.

Expected results:

The operator stays Degraded and we can't see why it is degraded (by inspecting the pod log).

Additional info:

Comment 4 Wenjing Zheng 2020-08-20 09:37:05 UTC
Verified on 4.6.0-0.nightly-2020-08-18-165040:
1. Make image pruner degrade;
2. Create a deployment with invalid image name
3. Watch image registry status: it remains degrade.

Comment 5 Ricardo Maraschini 2020-08-28 13:57:17 UTC
*** Bug 1857687 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2020-10-27 16:15:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.