1952268 – etcd operator should not set Degraded=True EtcdMembersDegraded on healthy machine-config node reboots

Bug 1952268 - etcd operator should not set Degraded=True EtcdMembersDegraded on healthy machine-config node reboots

Summary: etcd operator should not set Degraded=True EtcdMembersDegraded on healthy mac...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1946784 (view as bug list)
Depends On:
Blocks:	1976988
TreeView+	depends on / blocked

Reported:	2021-04-21 22:01 UTC by W. Trevor King
Modified:	2023-09-15 01:05 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1976988 (view as bug list)
Environment:
Last Closed:	2021-07-27 23:02:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 579	0	None	open	Bug 1952268: Increase inertia duration for the EtcdMembersDegraded condition	2021-04-28 03:15:14 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:02:54 UTC

Description W. Trevor King 2021-04-21 22:01:59 UTC

From CI runs like [1]:

  : [bz-Etcd] clusteroperator/etcd should not change condition/Degraded
    Run #0: Failed	0s
    3 unexpected clusteroperator state transitions during e2e test run 

    Apr 21 14:28:20.460 - 112s  E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-243-65.us-east-2.compute.internal is unhealthy
    Apr 21 14:34:21.216 - 68s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-156-170.us-east-2.compute.internal is unhealthy
    Apr 21 14:39:43.199 - 45s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-177-210.us-east-2.compute.internal is unhealthy

Handy interval chart down at the bottom of [1] shows the Degraded=True intervals occurring as the machine-config operator rolls each of the control-plane nodes.

Very popular in updates ending in 4.8+:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/etcd+should+not+change+condition/Degraded' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 15 runs, 93% failed, 93% of failures match = 87% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 16 runs, 75% failed, 133% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 15 runs, 100% failed, 87% of failures match = 87% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 9 runs, 89% failed, 50% of failures match = 44% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 86% failed, 100% of failures match = 86% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 89% of failures match = 89% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1384851693719523328

Comment 3 Suresh Kolichala 2021-05-07 16:58:55 UTC

*** Bug 1946784 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2021-07-27 23:02:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 9 Kirsten Garrison 2021-07-30 22:30:56 UTC

I'm looking into another bug but am still seeing this get hit a lot:
https://search.ci.openshift.org/?search=clusteroperator%2Fetcd+should+not+change+condition%2FDegraded&maxAge=336h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 73 runs, 85% failed, 115% of failures match = 97% impact

Comment 10 Red Hat Bugzilla 2023-09-15 01:05:28 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.