Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2058416

Summary: ClusterVersion Failing=True and Available=False should trigger alerts
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED DEFERRED QA Contact: Yang Yang <yanyang>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.6CC: aos-bugs, yanyang
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:13:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-02-24 20:50:19 UTC
We have ClusterOperatorDown and ClusterOperatorDegraded in this space for ClusterOperator conditions.  We should wire that up for ClusterVersion as well.

Comment 1 W. Trevor King 2022-02-24 22:50:00 UTC
Working on a reproducer in a cluster-bot cluster:

  $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
  4.11.0-0.nightly-2022-02-23-185405

Break something:

  $ oc -n openshift-config delete secret pull-secret
  secret "pull-secret" deleted

Machine-config didn't notice?

  $ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-controller --tail 2
  I0224 21:26:17.666549       1 template_controller.go:137] Re-syncing ControllerConfig due to secret pull-secret change
  I0224 21:59:14.450371       1 template_controller.go:137] Re-syncing ControllerConfig due to secret pull-secret change

Registry did, but it's not yet mad enough to go Degraded=True, Available=False, or anything that the CVO would be concerned about:

  $ oc get -o json clusteroperator image-registry | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2022-02-24T21:22:02Z Available=True Ready: Available: The registry is ready
  NodeCADaemonAvailable: The daemon set node-ca has available replicas
  ImagePrunerAvailable: Pruner CronJob has been created
  2022-02-24T22:11:00Z Progressing=True Error: Progressing: Unable to apply resources: unable to apply objects: failed to update object *v1.Secret, Namespace=openshift-image-registry, Name=installation-pull- secrets: Secret "installation-pull-secrets" is invalid: data[.dockerconfigjson]: Required value
  2022-02-24T21:21:05Z Degraded=False AsExpected: 

Trying to break things harder, I'll remove the machine-config controller pod so the replacement has to load the pull secret from memory:

  $ oc -n openshift-machine-config-operator delete pod -l k8s-app=machine-config-controller
  pod "machine-config-controller-688cc846-sph5p" deleted

Promising:

  $ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-controller --tail 2
  I0224 22:25:23.591925       1 render_controller.go:377] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(false) failing(true)
  I0224 22:25:23.591930       1 render_controller.go:377] Error syncing machineconfigpool worker: ControllerConfig has not completed: completed(false) running(false) failing(true)

Yeah, it's pretty mad now:

  $ oc get -o json clusteroperator machine-config | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2022-02-24T21:13:55Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-02-23-185405
  2022-02-24T22:25:10Z Degraded=True MachineConfigControllerFailed: Failed to resync 4.11.0-0.nightly-2022-02-23-185405 because: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)
  2022-02-24T22:25:10Z Available=False : Cluster not available for [{operator 4.11.0-0.nightly-2022-02-23-185405}]
  2022-02-24T21:22:38Z Upgradeable=True AsExpected: 

I dunno if I would have gone straight to Available=False for this, but Degraded=True is absolutely appropriate, and Available=False is convenient for testing this particular bug.  After a minute or ten, the CVO notices:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2022-02-24T21:09:37Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2022-02-24T21:09:37Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.11.0-0.nightly-2022-02-23-185405" image="registry.build01.ci.openshift.org/ci-ln-h2ztlr2/release@sha256:ba95a556da080b887baa8801e1f020d97c985f52a4de52c1027e1a710738fd97"
  2022-02-24T21:31:19Z Available=True : Done applying 4.11.0-0.nightly-2022-02-23-185405
  2022-02-24T22:33:04Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
  2022-02-24T21:31:19Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.11.0-0.nightly-2022-02-23-185405: the cluster operator machine-config has not yet successfully rolled out

Checking alerts in the cluster's console ( /monitoring/query-browser ):

  group by (alertname,alertstate,name) (ALERTS{alertname=~"ClusterOperator.*"})

gives:

  ClusterOperatorDegraded  pending  machine-config  1
  ClusterOperatorDown  firing  machine-config  1

Depending on how long you wait before checking, those may be pending or firing, but doesn't matter for this bug.  There's nothing with name=version, so folks got alerts in this case, but nothing from the ClusterVersion level of this stack.  And if something broke for one of the other CVO-managed resources and sent us failing without impacting any managed ClusterOperators, we'd need this bug fixed to get alert coverage.

Comment 4 Shiftzilla 2023-03-09 01:13:34 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9133