Bug 1952268

Summary:	etcd operator should not set Degraded=True EtcdMembersDegraded on healthy machine-config node reboots
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.8	CC:	kgarriso, rphillips, sbatsche, wlewis
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1976988 (view as bug list)		Environment:
Last Closed:	2021-07-27 23:02:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1976988

Description W. Trevor King 2021-04-21 22:01:59 UTC

From CI runs like [1]:

  : [bz-Etcd] clusteroperator/etcd should not change condition/Degraded
    Run #0: Failed	0s
    3 unexpected clusteroperator state transitions during e2e test run 

    Apr 21 14:28:20.460 - 112s  E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-243-65.us-east-2.compute.internal is unhealthy
    Apr 21 14:34:21.216 - 68s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-156-170.us-east-2.compute.internal is unhealthy
    Apr 21 14:39:43.199 - 45s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-177-210.us-east-2.compute.internal is unhealthy

Handy interval chart down at the bottom of [1] shows the Degraded=True intervals occurring as the machine-config operator rolls each of the control-plane nodes.

Very popular in updates ending in 4.8+:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/etcd+should+not+change+condition/Degraded' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 15 runs, 93% failed, 93% of failures match = 87% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 16 runs, 75% failed, 133% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 15 runs, 100% failed, 87% of failures match = 87% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 9 runs, 89% failed, 50% of failures match = 44% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 86% failed, 100% of failures match = 86% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 89% of failures match = 89% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1384851693719523328

Comment 3 Suresh Kolichala 2021-05-07 16:58:55 UTC

*** Bug 1946784 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2021-07-27 23:02:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 9 Kirsten Garrison 2021-07-30 22:30:56 UTC

I'm looking into another bug but am still seeing this get hit a lot:
https://search.ci.openshift.org/?search=clusteroperator%2Fetcd+should+not+change+condition%2FDegraded&maxAge=336h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 73 runs, 85% failed, 115% of failures match = 97% impact

Comment 10 Red Hat Bugzilla 2023-09-15 01:05:28 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days