Bug 1952268

Summary: etcd operator should not set Degraded=True EtcdMembersDegraded on healthy machine-config node reboots
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: kgarriso, rphillips, sbatsche, wlewis
Target Milestone: ---Keywords: Upgrades
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1976988 (view as bug list) Environment:
Last Closed: 2021-07-27 23:02:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1976988    

Description W. Trevor King 2021-04-21 22:01:59 UTC
From CI runs like [1]:

  : [bz-Etcd] clusteroperator/etcd should not change condition/Degraded
    Run #0: Failed	0s
    3 unexpected clusteroperator state transitions during e2e test run 

    Apr 21 14:28:20.460 - 112s  E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-243-65.us-east-2.compute.internal is unhealthy
    Apr 21 14:34:21.216 - 68s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-156-170.us-east-2.compute.internal is unhealthy
    Apr 21 14:39:43.199 - 45s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-177-210.us-east-2.compute.internal is unhealthy

Handy interval chart down at the bottom of [1] shows the Degraded=True intervals occurring as the machine-config operator rolls each of the control-plane nodes.

Very popular in updates ending in 4.8+:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/etcd+should+not+change+condition/Degraded' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 15 runs, 93% failed, 93% of failures match = 87% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 16 runs, 75% failed, 133% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 15 runs, 100% failed, 87% of failures match = 87% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 9 runs, 89% failed, 50% of failures match = 44% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 86% failed, 100% of failures match = 86% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 89% of failures match = 89% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1384851693719523328

Comment 3 Suresh Kolichala 2021-05-07 16:58:55 UTC
*** Bug 1946784 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2021-07-27 23:02:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 9 Kirsten Garrison 2021-07-30 22:30:56 UTC
I'm looking into another bug but am still seeing this get hit a lot:
https://search.ci.openshift.org/?search=clusteroperator%2Fetcd+should+not+change+condition%2FDegraded&maxAge=336h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 73 runs, 85% failed, 115% of failures match = 97% impact

Comment 10 Red Hat Bugzilla 2023-09-15 01:05:28 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days