Bug 1929917
Summary: | Bare-metal operator is firing for ClusterOperatorDown for 15m during 4.6 to 4.7 upgrade | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.7 | CC: | aos-bugs, dhellmann, hongkliu, jokerman, lmohanty, travi, wking, yanyang |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:45:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1941217 |
Description
Clayton Coleman
2021-02-17 22:43:14 UTC
ALERTS{alertname="ClusterOperatorDown", alertstate="firing", endpoint="metrics", instance="10.0.191.240:9099", job="cluster-version-operator", name="baremetal", namespace="openshift-cluster-version", pod="cluster-version-operator-757d457d65-4b794", service="cluster-version-operator", severity="critical"} is the firing alert. Team informed me that there is an instance, and I realized if you are scaled to zero the alert won't fire, so the fix is not to crash. cluster_operator_up should be "ClusterOperator is Available=True and Degraded!=True" [1]. What does that have to do with scaling or crashing? As an aside, I don't see any information in cluster_operator_up that is not aleady covered more granularly in cluster_operator_conditions). [1]: https://github.com/openshift/cluster-version-operator/blob/3640003453ad3186ce9ce360e4788bcfbacbd5e0/pkg/cvo/metrics.go#L374-L382 On the other hand, nothing alarming in: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep clusteroperator/baremetal Feb 17 18:02:59.679 I clusteroperator/baremetal created Feb 17 18:32:10.085 W clusteroperator/baremetal changed Available to True: AsExpected: Operational Feb 17 18:32:10.085 W clusteroperator/baremetal changed Disabled to True: UnsupportedPlatform: Nothing to do on this Platform So I'm clearly missing something. Ahh, just the long window between CVO pre-creating the ClusterOperator and the metal operator coming around to fill it in. This might actually be a CVO bug or a test-suite bug. I wonder when we created the metal pod... $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api" and (.involvedObject.name | startswith("cluster-baremetal-operator" ))) | .lastTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-66dd945996-72lxc to ip-10-0-243-68.ec2.internal null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-66dd945996-gfzgr to ip-10-0-179-43.ec2.internal 2021-02-17T18:32:02Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-66dd945996 to 1 2021-02-17T18:32:03Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-66dd945996-72lxc ... So yeah, not a metal bug if CVO is freaking out before it has even scheduled the metal operator. Moving to the CVO. Agree this would be a CVO bug if baremetal is filling out its info like everyone else, but ONLY if baremetal operator is filling itself out exactly like insights operator. The CVO creation of ClusterOperator landed in [1], which went out with 4.4's GA. We've had the existing cluster_operator_up and ClusterOperatorDown forever. So the CVO's vulnerability to this sort of thing is old. The baremetal operator is new in 4.7, and it takes long enough in 4.6->4.7 update for the CVO to get to the metal operator that the alert fires. [1]: https://github.com/openshift/cluster-version-operator/pull/318 The metal trigger only impacts 4.6->4.7 updates, because that's the only situation where the CVO creates the metal ClusterOperator (very early in the update) but delays the metal operator deployment until later in the update manifest graph [1]. So marking blocker-, because we are likely to get to 4.7.1 or higher before we have large volumes of production clusters flowing from 4.6 -> 4.7, and while false-positive critical alerts are annoying, they aren't bad enough to make us push back GA, and we're about out of time to get things into 4.7 before GA. [1]: https://github.com/openshift/cluster-version-operator/blob/bac69e03599f8c57f46585246618e462c31c2e93/docs/user/reconciliation.md#manifest-graph In the future this behavior will be prohibited for new CO entering the payload as we tighten what is allowed to fire (this is pure noise). I’m actually not clear why a new CO should be moved to the front of an upgrade - that’s not obviously correct. The CVO didn't used to create ClusterOperators at all. It learned to do that in [1], where you did a good job summarizing the motivation: > The must-gather and insights operator depend on cluster operators and related objects in order to identify resources to create. Because cluster operators are delegated to the operator install and upgrade failures of new operators can fail to gather the requisite info if the cluster degrades before those steps. So there's not much utility in adding the ClusterOperators before we get to the operator namespace, but outside of this one alerting issue it shouldn't hurt either. And a critical ClusterOperatorDown firing because the ClusterOperator has no conditions on it is an overly jumpy alert. We shouldn't be paging admins at midnight unless we have a significant signal like ClusterOperatorDegraded firing because ClusterVersion Failing=True because a 2nd level operator's Deployment is mad. Or ClusterOperatorDown firing because a ClusterOperator has Available=False. "Degraded unset for a few minutes" isn't a big deal. "Degraded unset for days" might be, but that's not what's going on here. [1]: https://github.com/openshift/cluster-version-operator/pull/318 And this can happen on 4.7+ installs to, due to the time it takes to bring up control-plane nodes to schedule the baremetal operator [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer289904640/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp' 2021-02-24T14:21:31Z $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640/artifacts/e2e-gcp/deployments.json | jq -r '.items[].metadata | select(.name | contains("metal")).creationTimestamp' 2021-02-24T14:22:14Z $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640/artifacts/e2e-gcp/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort 2021-02-24T14:22:14Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-9fbdd6bc4 to 1 2021-02-24T14:22:14Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-9fbdd6bc4-xbltg 2021-02-24T14:28:28Z 1 Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-9fbdd6bc4-xbltg to ci-op-yfm17yzy-df208-55dcw-master-1 2021-02-24T14:28:28Z 7 FailedMount: MountVolume.SetUp failed for volume "cluster-baremetal-operator-tls" : secret "cluster-baremetal-operator-tls" not found 2021-02-24T14:30:04Z 1 Created: Created container cluster-baremetal-operator 2021-02-24T14:30:05Z 1 Started: Started container cluster-baremetal-operator So ClusterOperator at 2021-02-24T14:21:31Z but operator container doesn't come up until 14:30:05Z, and then it maybe takes a little longer to set ClusterOperator conditions, and the alert fires. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364579724289904640 Reproducing it by upgrading cluster from 4.6.17 to 4.7.0 Steps to reproduce: 1. Install a gcp cluster with 4.6.17 2. Scale up machines # oc get machinesets.machine.openshift.io -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE yangyang0318-1-2qf75-worker-a 1 1 1 1 67m yangyang0318-1-2qf75-worker-b 1 1 1 1 67m yangyang0318-1-2qf75-worker-c 1 1 1 1 67m yangyang0318-1-2qf75-worker-f 0 0 67m # for i in {a,b,c}; do oc -n openshift-machine-api patch machineset yangyang0318-1-2qf75-worker-$i --type json -p '[{"op": "add", "path": "/spec/replicas", "value": 18}]'; done Wait for the machineset to scale up # oc get machinesets.machine.openshift.io -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE yangyang0318-1-2qf75-worker-a 18 18 18 18 81m yangyang0318-1-2qf75-worker-b 18 18 18 18 81m yangyang0318-1-2qf75-worker-c 18 18 18 18 81m yangyang0318-1-2qf75-worker-f 0 0 81m 3. Change the channel to candidate-4.7 # oc edit clusterversion clusterversion.config.openshift.io/version edited 4. Upgrade the cluster # oc adm upgrade --to=4.7.0 Updating to 4.7.0 5. Check clusterversion # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.17 True True 12m Working towards 4.7.0: 96 of 668 done (14% complete) 6. Check the alert # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0318-1.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown")' { "labels": { "alertname": "ClusterOperatorDown", "endpoint": "metrics", "instance": "10.0.0.5:9099", "job": "cluster-version-operator", "name": "baremetal", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-6d599b9d58-8k56q", "service": "cluster-version-operator", "severity": "critical" }, "annotations": { "message": "Cluster operator baremetal has not been available for 10 minutes. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible." }, "state": "firing", "activeAt": "2021-03-18T07:10:59.21303266Z", "value": "0e+00" } A critical ClusterOperatorDown alert for baremetal is firing. Verifying with 4.8.0-0.nightly-2021-03-19-061759 Install a cluster with 4.8.0-0.nightly-2021-03-19-061759 successfully. # oc get co | grep baremetal baremetal 4.8.0-0.nightly-2021-03-19-061759 True False False 102m [root@preserve-yangyangmerrn-1 resources]# cat clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp' 2021-03-19T07:32:19Z [root@preserve-yangyangmerrn-1 resources]# cat events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort 2021-03-19T07:24:47Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-7bd74c5579 to 1 2021-03-19T07:24:47Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-7bd74c5579-2mkjf 2021-03-19T07:30:46Z 7 FailedMount: MountVolume.SetUp failed for volume "cert" : secret "cluster-baremetal-webhook-server-cert" not found 2021-03-19T07:30:46Z 7 FailedMount: MountVolume.SetUp failed for volume "cluster-baremetal-operator-tls" : secret "cluster-baremetal-operator-tls" not found 2021-03-19T07:32:16Z 1 Created: Created container cluster-baremetal-operator 2021-03-19T07:32:17Z 1 Started: Started container cluster-baremetal-operator null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-7bd74c5579-2mkjf to yangyag0319-1-888f6-master-0.c.openshift-qe.internal Looks like baremetal is not precreated. Moving it to verified state. Turns out it will be the 4.6.z PR that has an impact, so no need for a note on this one. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |