Bug 1955300 - Machine config operator reports unavailable for 23m during upgrade
Summary: Machine config operator reports unavailable for 23m during upgrade
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ---
: 4.10.0
Assignee: Kirsten Garrison
QA Contact: Rio Liu
URL:
Whiteboard:
: 1943289 1948088 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-29 20:02 UTC by Clayton Coleman
Modified: 2021-10-08 10:08 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2721 0 None Merged Bug 1955300: tighten operator availability conditions 2021-10-04 18:56:36 UTC
Github openshift machine-config-operator pull 2728 0 None Merged Bug 1955300: operator: add event on degraded and unavailable status 2021-10-04 18:56:40 UTC

Description Clayton Coleman 2021-04-29 20:02:55 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064

machine-config-operator was unavailable for 23m and fired critical alerts as a result.

Operators may not be unavailable during upgrade (and degraded should only be reported when abnormal behavior persists during upgrade, which does not include normal machine rollout).

Urgent because 30% of gcp-upgrade runs in 4.8 report this failure https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Kirsten Garrison 2021-06-03 18:39:04 UTC
*** Bug 1948088 has been marked as a duplicate of this bug. ***

Comment 2 W. Trevor King 2021-06-03 19:29:08 UTC
Dropping some snippets from [1] for Sippy and other searchers:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h8m30s
    Apr 29 19:06:20.432: Unexpected alerts fired or pending during the upgrade:

    alert ClusterOperatorDown fired for 510 seconds with labels: {endpoint="metrics", instance="10.0.0.5:9099", job="cluster-version-operator", name="machine-config", namespace="openshift-cluster-version", pod="cluster-version-operator-86bd9bcc78-qjqzq", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-04-29-134412"}

  [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
    Run #0: Failed expand_less	0s
    2 unexpected clusteroperator state transitions during e2e test run 

    Apr 29 18:40:18.533 - 1394s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.ci-2021-04-29-134412
    1 tests failed during this blip (2021-04-29 18:40:18.533880956 +0000 UTC to 2021-04-29 18:40:18.533880956 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387809347593048064

Comment 3 Scott Dodson 2021-07-15 18:55:14 UTC
This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded or unavailable during normal operations.

Comment 4 Kirsten Garrison 2021-07-15 19:35:04 UTC
For my own ref:

https://search.ci.openshift.org/?search=alert+ClusterOperatorDown+fired.*machine-config&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Across 7 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 13 runs, 62% failed, 25% of failures match = 15% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 27 runs, 56% failed, 33% of failures match = 19% impact

Across 14days:
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 64 runs, 64% failed, 12% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 23 runs, 61% failed, 21% of failures match = 13% impact

Comment 5 Kirsten Garrison 2021-07-30 21:12:03 UTC
*** Bug 1943289 has been marked as a duplicate of this bug. ***

Comment 6 Kirsten Garrison 2021-08-10 16:57:27 UTC
7 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 25 runs, 72% failed, 6% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 408 runs, 86% failed, 1% of failures match = 1% impact

14 days:
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 36 runs, 47% failed, 24% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 59 runs, 66% failed, 10% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 803 runs, 78% failed, 1% of failures match = 1% impact

Comment 7 Scott Dodson 2021-08-17 18:23:16 UTC
Discussed with Kirsten. A lot of this is iterative in that there were likely other contributing factors at the time this bug was filed which led to the level of failure originally observed. The more recent numbers show improvement both due to general platform stability (by comparing 4.8 against 4.8) and Kirsten has some improvements that she'll get in for 4.9 code freeze. As long as those continue to show improvement we should consider this VERIFIED and a Jira will be opened to track additional improvements in 4.10 and efforts to backport the whole set to 4.8. In other words, for 4.9 we may still see some MCO unavailability.


Note You need to log in before you can comment on or make changes to this bug.