https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064 machine-config-operator was unavailable for 23m and fired critical alerts as a result. Operators may not be unavailable during upgrade (and degraded should only be reported when abnormal behavior persists during upgrade, which does not include normal machine rollout). Urgent because 30% of gcp-upgrade runs in 4.8 report this failure https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
*** Bug 1948088 has been marked as a duplicate of this bug. ***
Dropping some snippets from [1] for Sippy and other searchers: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h8m30s Apr 29 19:06:20.432: Unexpected alerts fired or pending during the upgrade: alert ClusterOperatorDown fired for 510 seconds with labels: {endpoint="metrics", instance="10.0.0.5:9099", job="cluster-version-operator", name="machine-config", namespace="openshift-cluster-version", pod="cluster-version-operator-86bd9bcc78-qjqzq", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-04-29-134412"} [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available Run #0: Failed expand_less 0s 2 unexpected clusteroperator state transitions during e2e test run Apr 29 18:40:18.533 - 1394s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.ci-2021-04-29-134412 1 tests failed during this blip (2021-04-29 18:40:18.533880956 +0000 UTC to 2021-04-29 18:40:18.533880956 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064
This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded or unavailable during normal operations.
For my own ref: https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Across 7 days: periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 13 runs, 62% failed, 25% of failures match = 15% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 27 runs, 56% failed, 33% of failures match = 19% impact Across 14days: periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 64 runs, 64% failed, 12% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 23 runs, 61% failed, 21% of failures match = 13% impact
*** Bug 1943289 has been marked as a duplicate of this bug. ***
7 days: periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 14 runs, 43% failed, 17% of failures match = 7% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 25 runs, 72% failed, 6% of failures match = 4% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 408 runs, 86% failed, 1% of failures match = 1% impact 14 days: periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 36 runs, 47% failed, 24% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 59 runs, 66% failed, 10% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 803 runs, 78% failed, 1% of failures match = 1% impact
Discussed with Kirsten. A lot of this is iterative in that there were likely other contributing factors at the time this bug was filed which led to the level of failure originally observed. The more recent numbers show improvement both due to general platform stability (by comparing 4.8 against 4.8) and Kirsten has some improvements that she'll get in for 4.9 code freeze. As long as those continue to show improvement we should consider this VERIFIED and a Jira will be opened to track additional improvements in 4.10 and efforts to backport the whole set to 4.8. In other words, for 4.9 we may still see some MCO unavailability.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
*** Bug 2000937 has been marked as a duplicate of this bug. ***