1955300 – Machine config operator reports unavailable for 23m during upgrade

Bug 1955300 - Machine config operator reports unavailable for 23m during upgrade

Summary: Machine config operator reports unavailable for 23m during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Kirsten Garrison
QA Contact:	Rio Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1943289 1948088 2000937 (view as bug list)
Depends On:
Blocks:	2050911
TreeView+	depends on / blocked

Reported:	2021-04-29 20:02 UTC by Clayton Coleman
Modified:	2022-10-11 15:19 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:03:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2721	None	Merged	Bug 1955300: tighten operator availability conditions	2021-10-04 18:56:36 UTC
Github	openshift machine-config-operator pull 2728	None	Merged	Bug 1955300: operator: add event on degraded and unavailable status	2021-10-04 18:56:40 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:03:35 UTC

Description Clayton Coleman 2021-04-29 20:02:55 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064

machine-config-operator was unavailable for 23m and fired critical alerts as a result.

Operators may not be unavailable during upgrade (and degraded should only be reported when abnormal behavior persists during upgrade, which does not include normal machine rollout).

Urgent because 30% of gcp-upgrade runs in 4.8 report this failure https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Kirsten Garrison 2021-06-03 18:39:04 UTC

*** Bug 1948088 has been marked as a duplicate of this bug. ***

Comment 2 W. Trevor King 2021-06-03 19:29:08 UTC

Dropping some snippets from [1] for Sippy and other searchers:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h8m30s
    Apr 29 19:06:20.432: Unexpected alerts fired or pending during the upgrade:

    alert ClusterOperatorDown fired for 510 seconds with labels: {endpoint="metrics", instance="10.0.0.5:9099", job="cluster-version-operator", name="machine-config", namespace="openshift-cluster-version", pod="cluster-version-operator-86bd9bcc78-qjqzq", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-04-29-134412"}

  [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
    Run #0: Failed expand_less	0s
    2 unexpected clusteroperator state transitions during e2e test run 

    Apr 29 18:40:18.533 - 1394s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.ci-2021-04-29-134412
    1 tests failed during this blip (2021-04-29 18:40:18.533880956 +0000 UTC to 2021-04-29 18:40:18.533880956 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064

Comment 3 Scott Dodson 2021-07-15 18:55:14 UTC

This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded or unavailable during normal operations.

Comment 4 Kirsten Garrison 2021-07-15 19:35:04 UTC

For my own ref:

https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Across 7 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 13 runs, 62% failed, 25% of failures match = 15% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 27 runs, 56% failed, 33% of failures match = 19% impact

Across 14days:
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 64 runs, 64% failed, 12% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 23 runs, 61% failed, 21% of failures match = 13% impact

Comment 5 Kirsten Garrison 2021-07-30 21:12:03 UTC

*** Bug 1943289 has been marked as a duplicate of this bug. ***

Comment 6 Kirsten Garrison 2021-08-10 16:57:27 UTC

7 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 25 runs, 72% failed, 6% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 408 runs, 86% failed, 1% of failures match = 1% impact

14 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 36 runs, 47% failed, 24% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 59 runs, 66% failed, 10% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 803 runs, 78% failed, 1% of failures match = 1% impact

Comment 7 Scott Dodson 2021-08-17 18:23:16 UTC

Discussed with Kirsten. A lot of this is iterative in that there were likely other contributing factors at the time this bug was filed which led to the level of failure originally observed. The more recent numbers show improvement both due to general platform stability (by comparing 4.8 against 4.8) and Kirsten has some improvements that she'll get in for 4.9 code freeze. As long as those continue to show improvement we should consider this VERIFIED and a Jira will be opened to track additional improvements in 4.10 and efforts to backport the whole set to 4.8. In other words, for 4.9 we may still see some MCO unavailability.

Comment 17 errata-xmlrpc 2022-03-10 16:03:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 18 Kirsten Garrison 2022-05-18 19:47:43 UTC

*** Bug 2000937 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.