Bug 1929917

Summary:	Bare-metal operator is firing for ClusterOperatorDown for 15m during 4.6 to 4.7 upgrade
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Cluster Version Operator	Assignee:	W. Trevor King <wking>
Status:	CLOSED ERRATA	QA Contact:	Yang Yang <yanyang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	aos-bugs, dhellmann, hongkliu, jokerman, lmohanty, travi, wking, yanyang
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:45:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1941217

Description Clayton Coleman 2021-02-17 22:43:14 UTC

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216

The baremetal cluster operator is considered down by the ClusterOperatorDown alert from 18:14Z to 18:32Z (which is a critical alert).

baremetal is not even relevant for AWS clusters, so it should never go degraded.

The alert ClusterOperatorDown is defined as

        cluster_operator_up{job="cluster-version-operator"} == 0

today, except the operator is not running any instances, so of course it won't have a listening endpoint.

The fix here needs to scale to other operators.  In general, an operator that has a CO but is not running any instances needs to disable its service monitor somehow.

This is a blocker because it causes a critical alert to fire during upgrade.

Comment 1 Clayton Coleman 2021-02-17 22:44:08 UTC

ALERTS{alertname="ClusterOperatorDown", alertstate="firing", endpoint="metrics", instance="10.0.191.240:9099", job="cluster-version-operator", name="baremetal", namespace="openshift-cluster-version", pod="cluster-version-operator-757d457d65-4b794", service="cluster-version-operator", severity="critical"}

is the firing alert.

Comment 2 Clayton Coleman 2021-02-17 22:48:36 UTC

Team informed me that there is an instance, and I realized if you are scaled to zero the alert won't fire, so the fix is not to crash.

Comment 3 W. Trevor King 2021-02-17 23:01:22 UTC

cluster_operator_up should be "ClusterOperator is Available=True and Degraded!=True" [1].  What does that have to do with scaling or crashing?  As an aside, I don't see any information in cluster_operator_up that is not aleady covered more granularly in cluster_operator_conditions).

[1]: https://github.com/openshift/cluster-version-operator/blob/3640003453ad3186ce9ce360e4788bcfbacbd5e0/pkg/cvo/metrics.go#L374-L382

Comment 4 W. Trevor King 2021-02-17 23:05:50 UTC

On the other hand, nothing alarming in:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep clusteroperator/baremetal
Feb 17 18:02:59.679 I clusteroperator/baremetal created
Feb 17 18:32:10.085 W clusteroperator/baremetal changed Available to True: AsExpected: Operational
Feb 17 18:32:10.085 W clusteroperator/baremetal changed Disabled to True: UnsupportedPlatform: Nothing to do on this Platform

So I'm clearly missing something.

Comment 5 W. Trevor King 2021-02-17 23:08:59 UTC

Ahh, just the long window between CVO pre-creating the ClusterOperator and the metal operator coming around to fill it in.  This might actually be a CVO bug or a test-suite bug.  I wonder when we created the metal pod...

Comment 6 W. Trevor King 2021-02-17 23:33:05 UTC

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api" and (.involvedObject.name | startswith("cluster-baremetal-operator"
))) | .lastTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort
 null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-66dd945996-72lxc to ip-10-0-243-68.ec2.internal
 null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-66dd945996-gfzgr to ip-10-0-179-43.ec2.internal
2021-02-17T18:32:02Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-66dd945996 to 1
2021-02-17T18:32:03Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-66dd945996-72lxc
...

So yeah, not a metal bug if CVO is freaking out before it has even scheduled the metal operator.  Moving to the CVO.

Comment 7 Clayton Coleman 2021-02-17 23:33:50 UTC

Agree this would be a CVO bug if baremetal is filling out its info like everyone else, but ONLY if baremetal operator is filling itself out exactly like insights operator.

Comment 8 W. Trevor King 2021-02-17 23:37:37 UTC

The CVO creation of ClusterOperator landed in [1], which went out with 4.4's GA.  We've had the existing cluster_operator_up and ClusterOperatorDown forever.  So the CVO's vulnerability to this sort of thing is old.  The baremetal operator is new in 4.7, and it takes long enough in 4.6->4.7 update for the CVO to get to the metal operator that the alert fires.

[1]: https://github.com/openshift/cluster-version-operator/pull/318

Comment 9 W. Trevor King 2021-02-17 23:54:20 UTC

The metal trigger only impacts 4.6->4.7 updates, because that's the only situation where the CVO creates the metal ClusterOperator (very early in the update) but delays the metal operator deployment until later in the update manifest graph [1].  So marking blocker-, because we are likely to get to 4.7.1 or higher before we have large volumes of production clusters flowing from 4.6 -> 4.7, and while false-positive critical alerts are annoying, they aren't bad enough to make us push back GA, and we're about out of time to get things into 4.7 before GA.

[1]: https://github.com/openshift/cluster-version-operator/blob/bac69e03599f8c57f46585246618e462c31c2e93/docs/user/reconciliation.md#manifest-graph

Comment 10 Clayton Coleman 2021-02-18 06:07:34 UTC

In the future this behavior will be prohibited for new CO entering the payload as we tighten what is allowed to fire (this is pure noise).  I’m actually not clear why a new CO should be moved to the front of an upgrade - that’s not obviously correct.

Comment 11 W. Trevor King 2021-02-18 06:27:08 UTC

The CVO didn't used to create ClusterOperators at all.  It learned to do that in [1], where you did a good job summarizing the motivation:

> The must-gather and insights operator depend on cluster operators and related objects in order to identify resources to create. Because cluster operators are delegated to the operator install and upgrade failures of new operators can fail to gather the requisite info if the cluster degrades before those steps.

So there's not much utility in adding the ClusterOperators before we get to the operator namespace, but outside of this one alerting issue it shouldn't hurt either.  And a critical ClusterOperatorDown firing because the ClusterOperator has no conditions on it is an overly jumpy alert.  We shouldn't be paging admins at midnight unless we have a significant signal like ClusterOperatorDegraded firing because ClusterVersion Failing=True because a 2nd level operator's Deployment is mad.  Or ClusterOperatorDown firing because a ClusterOperator has Available=False.  "Degraded unset for a few minutes" isn't a big deal.  "Degraded unset for days" might be, but that's not what's going on here.

[1]: https://github.com/openshift/cluster-version-operator/pull/318

Comment 12 W. Trevor King 2021-02-24 16:43:26 UTC

And this can happen on 4.7+ installs to, due to the time it takes to bring up control-plane nodes to schedule the baremetal operator [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer289904640/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp'
2021-02-24T14:21:31Z
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640/artifacts/e2e-gcp/deployments.json | jq -r '.items[].metadata | select(.name | contains("metal")).creationTimestamp'
2021-02-24T14:22:14Z
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640/artifacts/e2e-gcp/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort
2021-02-24T14:22:14Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-9fbdd6bc4 to 1
2021-02-24T14:22:14Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-9fbdd6bc4-xbltg
2021-02-24T14:28:28Z 1 Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-9fbdd6bc4-xbltg to ci-op-yfm17yzy-df208-55dcw-master-1
2021-02-24T14:28:28Z 7 FailedMount: MountVolume.SetUp failed for volume "cluster-baremetal-operator-tls" : secret "cluster-baremetal-operator-tls" not found
2021-02-24T14:30:04Z 1 Created: Created container cluster-baremetal-operator
2021-02-24T14:30:05Z 1 Started: Started container cluster-baremetal-operator

So ClusterOperator at 2021-02-24T14:21:31Z but operator container doesn't come up until 14:30:05Z, and then it maybe takes a little longer to set ClusterOperator conditions, and the alert fires.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640

Comment 15 Yang Yang 2021-03-18 07:35:16 UTC

Reproducing it by upgrading cluster from 4.6.17 to 4.7.0

Steps to reproduce:
1. Install a gcp cluster with 4.6.17
2. Scale up machines
# oc get machinesets.machine.openshift.io -n openshift-machine-api
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
yangyang0318-1-2qf75-worker-a   1         1         1       1           67m
yangyang0318-1-2qf75-worker-b   1         1         1       1           67m
yangyang0318-1-2qf75-worker-c   1         1         1       1           67m
yangyang0318-1-2qf75-worker-f   0         0                             67m

# for i in {a,b,c}; do oc -n openshift-machine-api patch machineset yangyang0318-1-2qf75-worker-$i --type json -p '[{"op": "add", "path": "/spec/replicas", "value": 18}]'; done

Wait for the machineset to scale up

# oc get machinesets.machine.openshift.io -n openshift-machine-api
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
yangyang0318-1-2qf75-worker-a   18        18        18      18          81m
yangyang0318-1-2qf75-worker-b   18        18        18      18          81m
yangyang0318-1-2qf75-worker-c   18        18        18      18          81m
yangyang0318-1-2qf75-worker-f   0         0                             81m

3. Change the channel to candidate-4.7
# oc edit clusterversion
clusterversion.config.openshift.io/version edited

4. Upgrade the cluster
# oc adm upgrade --to=4.7.0
Updating to 4.7.0

5. Check clusterversion
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.17    True        True          12m     Working towards 4.7.0: 96 of 668 done (14% complete)

6. Check the alert
# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0318-1.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown")'
{
  "labels": {
    "alertname": "ClusterOperatorDown",
    "endpoint": "metrics",
    "instance": "10.0.0.5:9099",
    "job": "cluster-version-operator",
    "name": "baremetal",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-6d599b9d58-8k56q",
    "service": "cluster-version-operator",
    "severity": "critical"
  },
  "annotations": {
    "message": "Cluster operator baremetal has not been available for 10 minutes. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible."
  },
  "state": "firing",
  "activeAt": "2021-03-18T07:10:59.21303266Z",
  "value": "0e+00"
}

A critical ClusterOperatorDown alert for baremetal is firing.

Comment 17 Yang Yang 2021-03-19 09:16:57 UTC

Verifying with 4.8.0-0.nightly-2021-03-19-061759

Install a cluster with 4.8.0-0.nightly-2021-03-19-061759 successfully.

# oc get co | grep baremetal
baremetal           4.8.0-0.nightly-2021-03-19-061759   True        False         False      102m


[root@preserve-yangyangmerrn-1 resources]# cat clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp'
2021-03-19T07:32:19Z

[root@preserve-yangyangmerrn-1 resources]# cat events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort
2021-03-19T07:24:47Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-7bd74c5579 to 1
2021-03-19T07:24:47Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-7bd74c5579-2mkjf
2021-03-19T07:30:46Z 7 FailedMount: MountVolume.SetUp failed for volume "cert" : secret "cluster-baremetal-webhook-server-cert" not found
2021-03-19T07:30:46Z 7 FailedMount: MountVolume.SetUp failed for volume "cluster-baremetal-operator-tls" : secret "cluster-baremetal-operator-tls" not found
2021-03-19T07:32:16Z 1 Created: Created container cluster-baremetal-operator
2021-03-19T07:32:17Z 1 Started: Started container cluster-baremetal-operator
 null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-7bd74c5579-2mkjf to yangyag0319-1-888f6-master-0.c.openshift-qe.internal

Looks like baremetal is not precreated. Moving it to verified state.

Comment 18 W. Trevor King 2021-03-29 03:24:42 UTC

Turns out it will be the 4.6.z PR that has an impact, so no need for a note on this one.

Comment 21 errata-xmlrpc 2021-07-27 22:45:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438