Bug 2071114

Summary:	divergent etcd revisions go undetected
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Etcd	Assignee:	W. Trevor King <wking>
Status:	CLOSED DEFERRED	QA Contact:	ge liu <geliu>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	agogala, Alexandros.Phinikarides, alray, asalvati, ccornejo, david.karlsen, deads, dpathak, dwest, geliu, Holger.Wolf, iheim, lmohanty, mifiedle, moddi, musman, nsu, oarribas, palonsor, pawankum, qguo, rdiazgav, rh-container, sbelmasg, sburke, seunlee, shzhou, skrenger, tjungblu, travi, wking, wlewis, yuokada
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2068601	Environment:
Last Closed:	2022-09-08 14:07:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-04-01 20:55:21 UTC

This series tracks alerting, which supports bug 2068601's init-time corruption detection, but:

* The initial corruption check keeps the corrupted member from coming up, preventing split-braining. Alerting just lets you know if something bad is happening; it doesn't block anything.
* The current initial corruption check seems to ignore large divergence [1]. Alerting will complain about any divergence, regardless of size.
* Adding a new alert PrometheusRule to existing clusters is an easier mitigating patch than adjusting the etcd-launching configuration.

[1]: https://github.com/etcd-io/etcd/issues/13766#issuecomment-1083033017