Bug 1955300

Summary:	Machine config operator reports unavailable for 23m during upgrade
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Machine Config Operator	Assignee:	Kirsten Garrison <kgarriso>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	medium	CC:	airshad, aos-bugs, jchaloup, jerzhang, kgarriso, mbargenq, mkrejci, mrobson, rioliu, sdodson, sregidor, travi, wking, yanyang
Version:	4.8
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:03:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2050911

Description Clayton Coleman 2021-04-29 20:02:55 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064

machine-config-operator was unavailable for 23m and fired critical alerts as a result.

Operators may not be unavailable during upgrade (and degraded should only be reported when abnormal behavior persists during upgrade, which does not include normal machine rollout).

Urgent because 30% of gcp-upgrade runs in 4.8 report this failure https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Kirsten Garrison 2021-06-03 18:39:04 UTC

*** Bug 1948088 has been marked as a duplicate of this bug. ***

Comment 2 W. Trevor King 2021-06-03 19:29:08 UTC

Dropping some snippets from [1] for Sippy and other searchers:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h8m30s
    Apr 29 19:06:20.432: Unexpected alerts fired or pending during the upgrade:

    alert ClusterOperatorDown fired for 510 seconds with labels: {endpoint="metrics", instance="10.0.0.5:9099", job="cluster-version-operator", name="machine-config", namespace="openshift-cluster-version", pod="cluster-version-operator-86bd9bcc78-qjqzq", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-04-29-134412"}

  [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
    Run #0: Failed expand_less	0s
    2 unexpected clusteroperator state transitions during e2e test run 

    Apr 29 18:40:18.533 - 1394s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.ci-2021-04-29-134412
    1 tests failed during this blip (2021-04-29 18:40:18.533880956 +0000 UTC to 2021-04-29 18:40:18.533880956 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064

Comment 3 Scott Dodson 2021-07-15 18:55:14 UTC

This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded or unavailable during normal operations.

Comment 4 Kirsten Garrison 2021-07-15 19:35:04 UTC

For my own ref:

https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Across 7 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 13 runs, 62% failed, 25% of failures match = 15% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 27 runs, 56% failed, 33% of failures match = 19% impact

Across 14days:
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 64 runs, 64% failed, 12% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 23 runs, 61% failed, 21% of failures match = 13% impact

Comment 5 Kirsten Garrison 2021-07-30 21:12:03 UTC

*** Bug 1943289 has been marked as a duplicate of this bug. ***

Comment 6 Kirsten Garrison 2021-08-10 16:57:27 UTC

7 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 25 runs, 72% failed, 6% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 408 runs, 86% failed, 1% of failures match = 1% impact

14 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 36 runs, 47% failed, 24% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 59 runs, 66% failed, 10% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 803 runs, 78% failed, 1% of failures match = 1% impact

Comment 7 Scott Dodson 2021-08-17 18:23:16 UTC

Discussed with Kirsten. A lot of this is iterative in that there were likely other contributing factors at the time this bug was filed which led to the level of failure originally observed. The more recent numbers show improvement both due to general platform stability (by comparing 4.8 against 4.8) and Kirsten has some improvements that she'll get in for 4.9 code freeze. As long as those continue to show improvement we should consider this VERIFIED and a Jira will be opened to track additional improvements in 4.10 and efforts to backport the whole set to 4.8. In other words, for 4.9 we may still see some MCO unavailability.

Comment 17 errata-xmlrpc 2022-03-10 16:03:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 18 Kirsten Garrison 2022-05-18 19:47:43 UTC

*** Bug 2000937 has been marked as a duplicate of this bug. ***