Cause:
Lack of analysis on operand Deployment status when defining the operator status.
Consequence:
In some scenarios Image Registry Operator was presenting itself with two contradicting pieces of information: it was informing the user that it was not Available and at the same time not Degraded. These two conditions were still being presented even after the Deployment stopped retrying to get image registry up and running, during such scenario the Degraded flag should be set by the Operator.
Fix:
By taking image registry Deployment into account the operator now sets itself to Degraded if the operand Deployment reached its progress deadline when trying to get the application up.
Result:
Now when the Deployment fails (after progress deadline has been reached) the operator sets itself to Degraded. The operator still reports itself as Progressing while the operator Deployment is progressing.
Seen in 4.6.0-rc.4 shortly after an update from 4.5:
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
image-registry 4.6.0-rc.4 False True False 5m9s
which doesn't make sense, because if an operator is Available=False, it is definitely degraded. The code managing Degraded is currently completely separate from the code managing Available [1]. We should probably have a case like:
else if operatorAvailable.Status = operatorapiv1.ConditionFalse {
operatorDegraded.Status = operatorapiv1.ConditionTrue
operatorDegraded.Message = "Image registry is not available."
operatorDegraded.Reason = "Unavailble"
}
While we're rerolling conditions, we might want to adjust the current Available default [2], because being Available=False with no reason or message isn't very helpful. Either default to Available=True and turn it off if you find an issue, or make the default:
operatorAvailable := operatorapiv1.OperatorCondition{
Status: operatorapiv1.ConditionFalse,
Message: "Please open a support case with a must-gather for this cluster, so we can set a useful condition reason in this case",
Reason: "ShouldNeverHappen",
}
or whatever.
[1]: https://github.com/openshift/cluster-image-registry-operator/blob/f18c38544598b6c0f115d9fd6916efab9793a5d6/pkg/operator/status.go#L297-L314
[2]: https://github.com/openshift/cluster-image-registry-operator/blob/f18c38544598b6c0f115d9fd6916efab9793a5d6/pkg/operator/status.go#L217-L221
Conditions in OpenShift 4 by design are independent. It's a valid state of the operator when it has Available=False Progressing=True Degraded=False:
Available=False means that the registry is expected to be running, but it doesn't have available replicas.
Progressing=True means that operators/controllers still have job to be done.
Degraded=False means that if everything else is healthy on the cluster, the operator should eventually finish its job and the registry should become available.
That's exactly the state of the operator during the bootstrap. If you kill all its pods, it won't be available, but eventually it'll restore its state (i.e. it's not Degraded).
Re: Available=False with no reason or message — you are nitpicking, Reason and Message are set in all cases.
When image registry is degraded, Available wont be True, but instead False now:
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
image-registry 4.7.0-0.nightly-2020-12-21-131655 False True True 4m42s
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2020:5633
Seen in 4.6.0-rc.4 shortly after an update from 4.5: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE image-registry 4.6.0-rc.4 False True False 5m9s which doesn't make sense, because if an operator is Available=False, it is definitely degraded. The code managing Degraded is currently completely separate from the code managing Available [1]. We should probably have a case like: else if operatorAvailable.Status = operatorapiv1.ConditionFalse { operatorDegraded.Status = operatorapiv1.ConditionTrue operatorDegraded.Message = "Image registry is not available." operatorDegraded.Reason = "Unavailble" } While we're rerolling conditions, we might want to adjust the current Available default [2], because being Available=False with no reason or message isn't very helpful. Either default to Available=True and turn it off if you find an issue, or make the default: operatorAvailable := operatorapiv1.OperatorCondition{ Status: operatorapiv1.ConditionFalse, Message: "Please open a support case with a must-gather for this cluster, so we can set a useful condition reason in this case", Reason: "ShouldNeverHappen", } or whatever. [1]: https://github.com/openshift/cluster-image-registry-operator/blob/f18c38544598b6c0f115d9fd6916efab9793a5d6/pkg/operator/status.go#L297-L314 [2]: https://github.com/openshift/cluster-image-registry-operator/blob/f18c38544598b6c0f115d9fd6916efab9793a5d6/pkg/operator/status.go#L217-L221