1824986 – [4.3 upgrade][alert]: etcdMembersDown

Bug 1824986 - [4.3 upgrade][alert]: etcdMembersDown

Summary: [4.3 upgrade][alert]: etcdMembersDown

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-16 19:25 UTC by Hongkai Liu
Modified:	2023-09-14 05:55 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-14 14:47:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2020-04-16 19:25:03 UTC

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587060345456100


[FIRING:1] etcdMembersDown etcd (openshift-monitoring/k8s critical)
etcd cluster "etcd": members are down (1).


must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 1 Sam Batschelet 2020-04-16 22:10:03 UTC

When upgrade rolls through control-plane each master node will be rebooted. During that reboot cycle the conditions for etcdMembersDown can be triggered. I suppose we could explore a silence for the alert during the upgrade but this is critical information if it were not expected.

Comment 2 W. Trevor King 2020-04-17 04:39:28 UTC

> During that reboot cycle the conditions for etcdMembersDown can be triggered.

I don't think anyone's arguing for a blanket removal.  But I expect the alert can be softened to allow outages that cover most node reboots before the 'critical' sirens start wailing.  I know there's no hard cap on reboot time, but having a  setting so that these do not fire in 99% of reboots across supported providers or some such seems reasonable.  The machine-config folks should be able to provide you with metrics around how long node reboots generally take to inform the tuning.

Comment 4 Michal Fojtik 2020-05-18 07:29:31 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 7 Dan Mace 2020-05-20 18:33:09 UTC

We want to improve the alerting here, but we can't do it in 4.5. Moving to 4.6.

Comment 8 Michal Fojtik 2020-05-27 00:02:11 UTC

This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

Comment 9 W. Trevor King 2020-05-27 02:37:03 UTC

Hongkai beat me to reopening this, but didn't provide motivation.  My motivation was going to be "why should you expect weekly bumps on a bug that has UpcomingSprint and which has been punted to 4.6?".

Comment 10 Steve Kuznetsov 2020-06-11 14:23:22 UTC

This bug is due to a critical-level alert firing, telling us that etcd is down, which is absolutely self-resolving and not actionable. @mfojtik it should not really need a reason to remain open IMO.

Comment 11 Steve Kuznetsov 2020-06-11 14:23:59 UTC

We just saw it from 4.4.7-->4.4.8

Comment 15 Dan Mace 2020-07-14 14:47:14 UTC

This is being tracked in https://issues.redhat.com/browse/ETCD-95 now.

Comment 17 Red Hat Bugzilla 2023-09-14 05:55:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.