1929924 – etcdMembersDown should not fire during upgrade if the down pod been on an unschedulable node for less than a reasonable time

Bug 1929924 - etcdMembersDown should not fire during upgrade if the down pod been on an unschedulable node for less than a reasonable time [NEEDINFO]

Summary: etcdMembersDown should not fire during upgrade if the down pod been on an uns...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Clayton Coleman
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:	LifecycleStale
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-17 23:16 UTC by Clayton Coleman
Modified:	2021-06-23 14:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-23 14:35:12 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Description Clayton Coleman 2021-02-17 23:16:38 UTC

etcdInsufficientMembers is a normal part of upgrade. It should not fire during an upgrade if it is on an unschedulable master that has been marked unschedulable for drain for some duration less than 15-30m.  Once that threshold is exceeded it should fire.

So the preconditions for when it should not fire are when the current node has been unschedulable for less than 25m, which is roughly the amount of time a slow bare metal node should take to drain the master, get rebooted, and upgrade (measured as about 10m total on cloud with 2m reboot). If in practice 25m is not enough, we should consider extending the grace period.

The current alert query is

sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"}) without (instance) + 1) / 2)

The alert for openshift should be:

(count of instances that are up) < ((instances that could be up and are not on nodes that have been continuously unschedulable for less than X m + 1) / 2)

The quorum lost alerts will handle the rest.

Comment 1 Clayton Coleman 2021-02-17 23:30:15 UTC

Oops, I meant etcdMembersDown.  Insufficient members is usually "lost quorum".

Comment 2 Clayton Coleman 2021-02-18 01:59:51 UTC

Discovered https://bugzilla.redhat.com/show_bug.cgi?id=1929944 while investigating this.

Comment 3 Michal Fojtik 2021-03-20 02:20:21 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Note You need to log in before you can comment on or make changes to this bug.