Bug 1955300
Summary: | Machine config operator reports unavailable for 23m during upgrade | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Machine Config Operator | Assignee: | Kirsten Garrison <kgarriso> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | medium | CC: | airshad, aos-bugs, jchaloup, jerzhang, kgarriso, mbargenq, mkrejci, mrobson, rioliu, sdodson, sregidor, travi, wking, yanyang |
Version: | 4.8 | ||
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:03:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2050911 |
Description
Clayton Coleman
2021-04-29 20:02:55 UTC
*** Bug 1948088 has been marked as a duplicate of this bug. *** Dropping some snippets from [1] for Sippy and other searchers: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h8m30s Apr 29 19:06:20.432: Unexpected alerts fired or pending during the upgrade: alert ClusterOperatorDown fired for 510 seconds with labels: {endpoint="metrics", instance="10.0.0.5:9099", job="cluster-version-operator", name="machine-config", namespace="openshift-cluster-version", pod="cluster-version-operator-86bd9bcc78-qjqzq", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-04-29-134412"} [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available Run #0: Failed expand_less 0s 2 unexpected clusteroperator state transitions during e2e test run Apr 29 18:40:18.533 - 1394s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.ci-2021-04-29-134412 1 tests failed during this blip (2021-04-29 18:40:18.533880956 +0000 UTC to 2021-04-29 18:40:18.533880956 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064 This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded or unavailable during normal operations. For my own ref: https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Across 7 days: periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 13 runs, 62% failed, 25% of failures match = 15% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 27 runs, 56% failed, 33% of failures match = 19% impact Across 14days: periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 64 runs, 64% failed, 12% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 23 runs, 61% failed, 21% of failures match = 13% impact *** Bug 1943289 has been marked as a duplicate of this bug. *** 7 days: periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 14 runs, 43% failed, 17% of failures match = 7% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 25 runs, 72% failed, 6% of failures match = 4% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 408 runs, 86% failed, 1% of failures match = 1% impact 14 days: periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 36 runs, 47% failed, 24% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 59 runs, 66% failed, 10% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 803 runs, 78% failed, 1% of failures match = 1% impact Discussed with Kirsten. A lot of this is iterative in that there were likely other contributing factors at the time this bug was filed which led to the level of failure originally observed. The more recent numbers show improvement both due to general platform stability (by comparing 4.8 against 4.8) and Kirsten has some improvements that she'll get in for 4.9 code freeze. As long as those continue to show improvement we should consider this VERIFIED and a Jira will be opened to track additional improvements in 4.10 and efforts to backport the whole set to 4.8. In other words, for 4.9 we may still see some MCO unavailability. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 *** Bug 2000937 has been marked as a duplicate of this bug. *** |