Bug 2058416
| Summary: | ClusterVersion Failing=True and Available=False should trigger alerts | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
| Status: | CLOSED DEFERRED | QA Contact: | Yang Yang <yanyang> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | aos-bugs, yanyang |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-09 01:13:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
W. Trevor King
2022-02-24 20:50:19 UTC
Working on a reproducer in a cluster-bot cluster:
$ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
4.11.0-0.nightly-2022-02-23-185405
Break something:
$ oc -n openshift-config delete secret pull-secret
secret "pull-secret" deleted
Machine-config didn't notice?
$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-controller --tail 2
I0224 21:26:17.666549 1 template_controller.go:137] Re-syncing ControllerConfig due to secret pull-secret change
I0224 21:59:14.450371 1 template_controller.go:137] Re-syncing ControllerConfig due to secret pull-secret change
Registry did, but it's not yet mad enough to go Degraded=True, Available=False, or anything that the CVO would be concerned about:
$ oc get -o json clusteroperator image-registry | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-24T21:22:02Z Available=True Ready: Available: The registry is ready
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created
2022-02-24T22:11:00Z Progressing=True Error: Progressing: Unable to apply resources: unable to apply objects: failed to update object *v1.Secret, Namespace=openshift-image-registry, Name=installation-pull- secrets: Secret "installation-pull-secrets" is invalid: data[.dockerconfigjson]: Required value
2022-02-24T21:21:05Z Degraded=False AsExpected:
Trying to break things harder, I'll remove the machine-config controller pod so the replacement has to load the pull secret from memory:
$ oc -n openshift-machine-config-operator delete pod -l k8s-app=machine-config-controller
pod "machine-config-controller-688cc846-sph5p" deleted
Promising:
$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-controller --tail 2
I0224 22:25:23.591925 1 render_controller.go:377] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(false) failing(true)
I0224 22:25:23.591930 1 render_controller.go:377] Error syncing machineconfigpool worker: ControllerConfig has not completed: completed(false) running(false) failing(true)
Yeah, it's pretty mad now:
$ oc get -o json clusteroperator machine-config | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-24T21:13:55Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-02-23-185405
2022-02-24T22:25:10Z Degraded=True MachineConfigControllerFailed: Failed to resync 4.11.0-0.nightly-2022-02-23-185405 because: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)
2022-02-24T22:25:10Z Available=False : Cluster not available for [{operator 4.11.0-0.nightly-2022-02-23-185405}]
2022-02-24T21:22:38Z Upgradeable=True AsExpected:
I dunno if I would have gone straight to Available=False for this, but Degraded=True is absolutely appropriate, and Available=False is convenient for testing this particular bug. After a minute or ten, the CVO notices:
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-24T21:09:37Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2022-02-24T21:09:37Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.11.0-0.nightly-2022-02-23-185405" image="registry.build01.ci.openshift.org/ci-ln-h2ztlr2/release@sha256:ba95a556da080b887baa8801e1f020d97c985f52a4de52c1027e1a710738fd97"
2022-02-24T21:31:19Z Available=True : Done applying 4.11.0-0.nightly-2022-02-23-185405
2022-02-24T22:33:04Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
2022-02-24T21:31:19Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.11.0-0.nightly-2022-02-23-185405: the cluster operator machine-config has not yet successfully rolled out
Checking alerts in the cluster's console ( /monitoring/query-browser ):
group by (alertname,alertstate,name) (ALERTS{alertname=~"ClusterOperator.*"})
gives:
ClusterOperatorDegraded pending machine-config 1
ClusterOperatorDown firing machine-config 1
Depending on how long you wait before checking, those may be pending or firing, but doesn't matter for this bug. There's nothing with name=version, so folks got alerts in this case, but nothing from the ClusterVersion level of this stack. And if something broke for one of the other CVO-managed resources and sent us failing without impacting any managed ClusterOperators, we'd need this bug fixed to get alert coverage.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9133 |