Bug 1889921 - Reported Degraded=False Available=False pair does not make sense
Summary: Reported Degraded=False Available=False pair does not make sense
Keywords:
Status: POST
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.0
Assignee: Ricardo Maraschini
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 22:46 UTC by W. Trevor King
Modified: 2020-11-26 19:01 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-21 08:17:40 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 644 None open WIP - Bug 1889921: Reporting degraded if not available 2020-11-24 13:47:39 UTC

Description W. Trevor King 2020-10-20 22:46:24 UTC
Seen in 4.6.0-rc.4 shortly after an update from 4.5:

NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry                             4.6.0-rc.4   False       True          False      5m9s

which doesn't make sense, because if an operator is Available=False, it is definitely degraded.  The code managing Degraded is currently completely separate from the code managing Available [1].  We should probably have a case like:

  else if operatorAvailable.Status = operatorapiv1.ConditionFalse {
    operatorDegraded.Status = operatorapiv1.ConditionTrue
    operatorDegraded.Message = "Image registry is not available."
    operatorDegraded.Reason = "Unavailble"
  }

While we're rerolling conditions, we might want to adjust the current Available default [2], because being Available=False with no reason or message isn't very helpful.  Either default to Available=True and turn it off if you find an issue, or make the default:

  operatorAvailable := operatorapiv1.OperatorCondition{
    Status:  operatorapiv1.ConditionFalse,
    Message: "Please open a support case with a must-gather for this cluster, so we can set a useful condition reason in this case",
    Reason:  "ShouldNeverHappen",
  }

or whatever.

[1]: https://github.com/openshift/cluster-image-registry-operator/blob/f18c38544598b6c0f115d9fd6916efab9793a5d6/pkg/operator/status.go#L297-L314
[2]: https://github.com/openshift/cluster-image-registry-operator/blob/f18c38544598b6c0f115d9fd6916efab9793a5d6/pkg/operator/status.go#L217-L221

Comment 1 Oleg Bulatov 2020-10-21 08:17:40 UTC
Conditions in OpenShift 4 by design are independent. It's a valid state of the operator when it has Available=False Progressing=True Degraded=False:

Available=False means that the registry is expected to be running, but it doesn't have available replicas.
Progressing=True means that operators/controllers still have job to be done.
Degraded=False means that if everything else is healthy on the cluster, the operator should eventually finish its job and the registry should become available.

That's exactly the state of the operator during the bootstrap. If you kill all its pods, it won't be available, but eventually it'll restore its state (i.e. it's not Degraded).

Re: Available=False with no reason or message — you are nitpicking, Reason and Message are set in all cases.

Comment 2 Oleg Bulatov 2020-10-24 12:31:58 UTC
After some discussion I'm reopening this BZ. The meaning of conditions is a bit different [1]. We need to change the operator so that its conditions match their description.

[1]: https://github.com/openshift/api/blob/7756477c8346e54590741b198b2cf71c84803b6d/config/v1/types_cluster_operator.go#L153-L168


Note You need to log in before you can comment on or make changes to this bug.