Bug 1930224

Summary:	The etcdInsufficientMembers alert fires incorrectly when any instance is down and not when quorum is lost
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Etcd	Assignee:	Clayton Coleman <ccoleman>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.7	CC:	geliu, sbatsche, travi, wking
Target Milestone:	---
Target Release:	4.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1929944
Clones:	1930226 1937907 (view as bug list)		Environment:
Last Closed:	2021-03-11 16:56:38 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1929944
Bug Blocks:	1930226

Description Clayton Coleman 2021-02-18 14:28:23 UTC

+++ This bug was initially created as a clone of Bug #1929944 +++

etcdInsufficientMembers is supposed to fire when quorum is potentially lost. However, the difference in how upstream expects to configure etcd (instance label unique) and how OpenShift configures etcd (running in pods, so instance and pod label are unique) results in the alert firing spuriously during upgrades. This means the alert fires too eagerly.

The alert, once corrected, should only fire if we have reason to believe quorum is lost i.e. a majority of instances are down (must have (N+1)/2 instances up) where n is inferred from the expected number of scrape targets. The alert should have a better description and suggest possible areas to investigate, namely down control plane nodes or broken networking.

A runbook addition will come later.

Will be backported to 4.7 and 4.6

Comment 1 Sam Batschelet 2021-03-11 16:56:38 UTC


*** This bug has been marked as a duplicate of bug 1930876 ***