Bug 1929917 - Bare-metal operator is firing for ClusterOperatorDown for 15m during 4.6 to 4.7 upgrade
Summary: Bare-metal operator is firing for ClusterOperatorDown for 15m during 4.6 to 4...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.8.0
Assignee: W. Trevor King
QA Contact: Yang Yang
Depends On:
Blocks: 1941217
TreeView+ depends on / blocked
Reported: 2021-02-17 22:43 UTC by Clayton Coleman
Modified: 2021-07-27 22:47 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2021-07-27 22:45:14 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 531 0 None open Bug 1929917: pkg/cvo/sync_worker: Skip precreation of baremetal ClusterOperator 2021-03-16 18:02:09 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:47:27 UTC

Description Clayton Coleman 2021-02-17 22:43:14 UTC

The baremetal cluster operator is considered down by the ClusterOperatorDown alert from 18:14Z to 18:32Z (which is a critical alert).

baremetal is not even relevant for AWS clusters, so it should never go degraded.

The alert ClusterOperatorDown is defined as

        cluster_operator_up{job="cluster-version-operator"} == 0

today, except the operator is not running any instances, so of course it won't have a listening endpoint.

The fix here needs to scale to other operators.  In general, an operator that has a CO but is not running any instances needs to disable its service monitor somehow.

This is a blocker because it causes a critical alert to fire during upgrade.

Comment 1 Clayton Coleman 2021-02-17 22:44:08 UTC
ALERTS{alertname="ClusterOperatorDown", alertstate="firing", endpoint="metrics", instance="", job="cluster-version-operator", name="baremetal", namespace="openshift-cluster-version", pod="cluster-version-operator-757d457d65-4b794", service="cluster-version-operator", severity="critical"}

is the firing alert.

Comment 2 Clayton Coleman 2021-02-17 22:48:36 UTC
Team informed me that there is an instance, and I realized if you are scaled to zero the alert won't fire, so the fix is not to crash.

Comment 3 W. Trevor King 2021-02-17 23:01:22 UTC
cluster_operator_up should be "ClusterOperator is Available=True and Degraded!=True" [1].  What does that have to do with scaling or crashing?  As an aside, I don't see any information in cluster_operator_up that is not aleady covered more granularly in cluster_operator_conditions).

[1]: https://github.com/openshift/cluster-version-operator/blob/3640003453ad3186ce9ce360e4788bcfbacbd5e0/pkg/cvo/metrics.go#L374-L382

Comment 4 W. Trevor King 2021-02-17 23:05:50 UTC
On the other hand, nothing alarming in:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep clusteroperator/baremetal
Feb 17 18:02:59.679 I clusteroperator/baremetal created
Feb 17 18:32:10.085 W clusteroperator/baremetal changed Available to True: AsExpected: Operational
Feb 17 18:32:10.085 W clusteroperator/baremetal changed Disabled to True: UnsupportedPlatform: Nothing to do on this Platform

So I'm clearly missing something.

Comment 5 W. Trevor King 2021-02-17 23:08:59 UTC
Ahh, just the long window between CVO pre-creating the ClusterOperator and the metal operator coming around to fill it in.  This might actually be a CVO bug or a test-suite bug.  I wonder when we created the metal pod...

Comment 6 W. Trevor King 2021-02-17 23:33:05 UTC
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api" and (.involvedObject.name | startswith("cluster-baremetal-operator"
))) | .lastTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort
 null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-66dd945996-72lxc to ip-10-0-243-68.ec2.internal
 null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-66dd945996-gfzgr to ip-10-0-179-43.ec2.internal
2021-02-17T18:32:02Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-66dd945996 to 1
2021-02-17T18:32:03Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-66dd945996-72lxc

So yeah, not a metal bug if CVO is freaking out before it has even scheduled the metal operator.  Moving to the CVO.

Comment 7 Clayton Coleman 2021-02-17 23:33:50 UTC
Agree this would be a CVO bug if baremetal is filling out its info like everyone else, but ONLY if baremetal operator is filling itself out exactly like insights operator.

Comment 8 W. Trevor King 2021-02-17 23:37:37 UTC
The CVO creation of ClusterOperator landed in [1], which went out with 4.4's GA.  We've had the existing cluster_operator_up and ClusterOperatorDown forever.  So the CVO's vulnerability to this sort of thing is old.  The baremetal operator is new in 4.7, and it takes long enough in 4.6->4.7 update for the CVO to get to the metal operator that the alert fires.

[1]: https://github.com/openshift/cluster-version-operator/pull/318

Comment 9 W. Trevor King 2021-02-17 23:54:20 UTC
The metal trigger only impacts 4.6->4.7 updates, because that's the only situation where the CVO creates the metal ClusterOperator (very early in the update) but delays the metal operator deployment until later in the update manifest graph [1].  So marking blocker-, because we are likely to get to 4.7.1 or higher before we have large volumes of production clusters flowing from 4.6 -> 4.7, and while false-positive critical alerts are annoying, they aren't bad enough to make us push back GA, and we're about out of time to get things into 4.7 before GA.

[1]: https://github.com/openshift/cluster-version-operator/blob/bac69e03599f8c57f46585246618e462c31c2e93/docs/user/reconciliation.md#manifest-graph

Comment 10 Clayton Coleman 2021-02-18 06:07:34 UTC
In the future this behavior will be prohibited for new CO entering the payload as we tighten what is allowed to fire (this is pure noise).  I’m actually not clear why a new CO should be moved to the front of an upgrade - that’s not obviously correct.

Comment 11 W. Trevor King 2021-02-18 06:27:08 UTC
The CVO didn't used to create ClusterOperators at all.  It learned to do that in [1], where you did a good job summarizing the motivation:

> The must-gather and insights operator depend on cluster operators and related objects in order to identify resources to create. Because cluster operators are delegated to the operator install and upgrade failures of new operators can fail to gather the requisite info if the cluster degrades before those steps.

So there's not much utility in adding the ClusterOperators before we get to the operator namespace, but outside of this one alerting issue it shouldn't hurt either.  And a critical ClusterOperatorDown firing because the ClusterOperator has no conditions on it is an overly jumpy alert.  We shouldn't be paging admins at midnight unless we have a significant signal like ClusterOperatorDegraded firing because ClusterVersion Failing=True because a 2nd level operator's Deployment is mad.  Or ClusterOperatorDown firing because a ClusterOperator has Available=False.  "Degraded unset for a few minutes" isn't a big deal.  "Degraded unset for days" might be, but that's not what's going on here.

[1]: https://github.com/openshift/cluster-version-operator/pull/318

Comment 12 W. Trevor King 2021-02-24 16:43:26 UTC
And this can happen on 4.7+ installs to, due to the time it takes to bring up control-plane nodes to schedule the baremetal operator [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer289904640/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp'
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640/artifacts/e2e-gcp/deployments.json | jq -r '.items[].metadata | select(.name | contains("metal")).creationTimestamp'
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640/artifacts/e2e-gcp/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort
2021-02-24T14:22:14Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-9fbdd6bc4 to 1
2021-02-24T14:22:14Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-9fbdd6bc4-xbltg
2021-02-24T14:28:28Z 1 Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-9fbdd6bc4-xbltg to ci-op-yfm17yzy-df208-55dcw-master-1
2021-02-24T14:28:28Z 7 FailedMount: MountVolume.SetUp failed for volume "cluster-baremetal-operator-tls" : secret "cluster-baremetal-operator-tls" not found
2021-02-24T14:30:04Z 1 Created: Created container cluster-baremetal-operator
2021-02-24T14:30:05Z 1 Started: Started container cluster-baremetal-operator

So ClusterOperator at 2021-02-24T14:21:31Z but operator container doesn't come up until 14:30:05Z, and then it maybe takes a little longer to set ClusterOperator conditions, and the alert fires.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640

Comment 15 Yang Yang 2021-03-18 07:35:16 UTC
Reproducing it by upgrading cluster from 4.6.17 to 4.7.0

Steps to reproduce:
1. Install a gcp cluster with 4.6.17
2. Scale up machines
# oc get machinesets.machine.openshift.io -n openshift-machine-api
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
yangyang0318-1-2qf75-worker-a   1         1         1       1           67m
yangyang0318-1-2qf75-worker-b   1         1         1       1           67m
yangyang0318-1-2qf75-worker-c   1         1         1       1           67m
yangyang0318-1-2qf75-worker-f   0         0                             67m

# for i in {a,b,c}; do oc -n openshift-machine-api patch machineset yangyang0318-1-2qf75-worker-$i --type json -p '[{"op": "add", "path": "/spec/replicas", "value": 18}]'; done

Wait for the machineset to scale up

# oc get machinesets.machine.openshift.io -n openshift-machine-api
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
yangyang0318-1-2qf75-worker-a   18        18        18      18          81m
yangyang0318-1-2qf75-worker-b   18        18        18      18          81m
yangyang0318-1-2qf75-worker-c   18        18        18      18          81m
yangyang0318-1-2qf75-worker-f   0         0                             81m

3. Change the channel to candidate-4.7
# oc edit clusterversion
clusterversion.config.openshift.io/version edited

4. Upgrade the cluster
# oc adm upgrade --to=4.7.0
Updating to 4.7.0

5. Check clusterversion
# oc get clusterversion
version   4.6.17    True        True          12m     Working towards 4.7.0: 96 of 668 done (14% complete)

6. Check the alert
# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0318-1.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown")'
  "labels": {
    "alertname": "ClusterOperatorDown",
    "endpoint": "metrics",
    "instance": "",
    "job": "cluster-version-operator",
    "name": "baremetal",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-6d599b9d58-8k56q",
    "service": "cluster-version-operator",
    "severity": "critical"
  "annotations": {
    "message": "Cluster operator baremetal has not been available for 10 minutes. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible."
  "state": "firing",
  "activeAt": "2021-03-18T07:10:59.21303266Z",
  "value": "0e+00"

A critical ClusterOperatorDown alert for baremetal is firing.

Comment 17 Yang Yang 2021-03-19 09:16:57 UTC
Verifying with 4.8.0-0.nightly-2021-03-19-061759

Install a cluster with 4.8.0-0.nightly-2021-03-19-061759 successfully.

# oc get co | grep baremetal
baremetal           4.8.0-0.nightly-2021-03-19-061759   True        False         False      102m

[root@preserve-yangyangmerrn-1 resources]# cat clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp'

[root@preserve-yangyangmerrn-1 resources]# cat events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort
2021-03-19T07:24:47Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-7bd74c5579 to 1
2021-03-19T07:24:47Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-7bd74c5579-2mkjf
2021-03-19T07:30:46Z 7 FailedMount: MountVolume.SetUp failed for volume "cert" : secret "cluster-baremetal-webhook-server-cert" not found
2021-03-19T07:30:46Z 7 FailedMount: MountVolume.SetUp failed for volume "cluster-baremetal-operator-tls" : secret "cluster-baremetal-operator-tls" not found
2021-03-19T07:32:16Z 1 Created: Created container cluster-baremetal-operator
2021-03-19T07:32:17Z 1 Started: Started container cluster-baremetal-operator
 null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-7bd74c5579-2mkjf to yangyag0319-1-888f6-master-0.c.openshift-qe.internal

Looks like baremetal is not precreated. Moving it to verified state.

Comment 18 W. Trevor King 2021-03-29 03:24:42 UTC
Turns out it will be the 4.6.z PR that has an impact, so no need for a note on this one.

Comment 21 errata-xmlrpc 2021-07-27 22:45:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.