1937907 – The etcdInsufficientMembers alert fires incorrectly when any instance is down and not when quorum is lost

Bug 1937907 - The etcdInsufficientMembers alert fires incorrectly when any instance is down and not when quorum is lost

Summary: The etcdInsufficientMembers alert fires incorrectly when any instance is down...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1930226 (view as bug list)
Depends On:	1930876
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-11 17:44 UTC by Sam Batschelet
Modified:	2024-10-01 17:40 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1930224
Environment:
Last Closed:	2021-03-30 17:03:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1080	0	None	open	Bug 1937907: etcdInsufficientMembers is wrong when etcd is in a pod	2021-03-11 17:58:29 UTC
Red Hat Product Errata	RHBA-2021:0952	0	None	None	None	2021-03-30 17:03:24 UTC

Description Sam Batschelet 2021-03-11 17:44:42 UTC

+++ This bug was initially created as a clone of Bug #1930224 +++

+++ This bug was initially created as a clone of Bug #1929944 +++

etcdInsufficientMembers is supposed to fire when quorum is potentially lost. However, the difference in how upstream expects to configure etcd (instance label unique) and how OpenShift configures etcd (running in pods, so instance and pod label are unique) results in the alert firing spuriously during upgrades. This means the alert fires too eagerly.

The alert, once corrected, should only fire if we have reason to believe quorum is lost i.e. a majority of instances are down (must have (N+1)/2 instances up) where n is inferred from the expected number of scrape targets. The alert should have a better description and suggest possible areas to investigate, namely down control plane nodes or broken networking.

A runbook addition will come later.

Will be backported to 4.7 and 4.6

--- Additional comment from Sam Batschelet on 2021-03-11 16:56:38 UTC ---

Comment 1 Sam Batschelet 2021-03-11 18:02:51 UTC

*** Bug 1930226 has been marked as a duplicate of this bug. ***

Comment 6 errata-xmlrpc 2021-03-30 17:03:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.23 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0952

Note You need to log in before you can comment on or make changes to this bug.