Bug 1889921

Summary: Reported Degraded=False Available=False pair does not make sense
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Image RegistryAssignee: Ricardo Maraschini <rmarasch>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: medium Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, obulatov
Target Milestone: ---Keywords: Reopened
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Lack of analysis on operand Deployment status when defining the operator status. Consequence: In some scenarios Image Registry Operator was presenting itself with two contradicting pieces of information: it was informing the user that it was not Available and at the same time not Degraded. These two conditions were still being presented even after the Deployment stopped retrying to get image registry up and running, during such scenario the Degraded flag should be set by the Operator. Fix: By taking image registry Deployment into account the operator now sets itself to Degraded if the operand Deployment reached its progress deadline when trying to get the application up. Result: Now when the Deployment fails (after progress deadline has been reached) the operator sets itself to Degraded. The operator still reports itself as Progressing while the operator Deployment is progressing.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:27:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-10-20 22:46:24 UTC
Seen in 4.6.0-rc.4 shortly after an update from 4.5:

NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry                             4.6.0-rc.4   False       True          False      5m9s

which doesn't make sense, because if an operator is Available=False, it is definitely degraded.  The code managing Degraded is currently completely separate from the code managing Available [1].  We should probably have a case like:

  else if operatorAvailable.Status = operatorapiv1.ConditionFalse {
    operatorDegraded.Status = operatorapiv1.ConditionTrue
    operatorDegraded.Message = "Image registry is not available."
    operatorDegraded.Reason = "Unavailble"
  }

While we're rerolling conditions, we might want to adjust the current Available default [2], because being Available=False with no reason or message isn't very helpful.  Either default to Available=True and turn it off if you find an issue, or make the default:

  operatorAvailable := operatorapiv1.OperatorCondition{
    Status:  operatorapiv1.ConditionFalse,
    Message: "Please open a support case with a must-gather for this cluster, so we can set a useful condition reason in this case",
    Reason:  "ShouldNeverHappen",
  }

or whatever.

[1]: https://github.com/openshift/cluster-image-registry-operator/blob/f18c38544598b6c0f115d9fd6916efab9793a5d6/pkg/operator/status.go#L297-L314
[2]: https://github.com/openshift/cluster-image-registry-operator/blob/f18c38544598b6c0f115d9fd6916efab9793a5d6/pkg/operator/status.go#L217-L221

Comment 1 Oleg Bulatov 2020-10-21 08:17:40 UTC
Conditions in OpenShift 4 by design are independent. It's a valid state of the operator when it has Available=False Progressing=True Degraded=False:

Available=False means that the registry is expected to be running, but it doesn't have available replicas.
Progressing=True means that operators/controllers still have job to be done.
Degraded=False means that if everything else is healthy on the cluster, the operator should eventually finish its job and the registry should become available.

That's exactly the state of the operator during the bootstrap. If you kill all its pods, it won't be available, but eventually it'll restore its state (i.e. it's not Degraded).

Re: Available=False with no reason or message — you are nitpicking, Reason and Message are set in all cases.

Comment 2 Oleg Bulatov 2020-10-24 12:31:58 UTC
After some discussion I'm reopening this BZ. The meaning of conditions is a bit different [1]. We need to change the operator so that its conditions match their description.

[1]: https://github.com/openshift/api/blob/7756477c8346e54590741b198b2cf71c84803b6d/config/v1/types_cluster_operator.go#L153-L168

Comment 13 Wenjing Zheng 2020-12-29 08:21:47 UTC
When image registry is degraded, Available wont be True, but instead False now:
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry                             4.7.0-0.nightly-2020-12-21-131655   False       True          True       4m42s

Comment 15 errata-xmlrpc 2021-02-24 15:27:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633