Bug 2071114

Summary: divergent etcd revisions go undetected
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: W. Trevor King <wking>
Status: CLOSED DEFERRED QA Contact: ge liu <geliu>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.9CC: agogala, Alexandros.Phinikarides, alray, asalvati, ccornejo, david.karlsen, deads, dpathak, dwest, geliu, Holger.Wolf, iheim, lmohanty, mifiedle, moddi, musman, nsu, oarribas, palonsor, pawankum, qguo, rdiazgav, rh-container, sbelmasg, sburke, seunlee, shzhou, skrenger, tjungblu, travi, wking, wlewis, yuokada
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2068601 Environment:
Last Closed: 2022-09-08 14:07:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-04-01 20:55:21 UTC
This series tracks alerting, which supports bug 2068601's init-time corruption detection, but:

* The initial corruption check keeps the corrupted member from coming up, preventing split-braining. Alerting just lets you know if something bad is happening; it doesn't block anything.
* The current initial corruption check seems to ignore large divergence [1]. Alerting will complain about any divergence, regardless of size.
* Adding a new alert PrometheusRule to existing clusters is an easier mitigating patch than adjusting the etcd-launching configuration.

[1]: https://github.com/etcd-io/etcd/issues/13766#issuecomment-1083033017