etcdInsufficientMembers is supposed to fire when quorum is potentially lost. However, the difference in how upstream expects to configure etcd (instance label unique) and how OpenShift configures etcd (running in pods, so instance and pod label are unique) results in the alert firing spuriously during upgrades. This means the alert fires too eagerly.
The alert, once corrected, should only fire if we have reason to believe quorum is lost i.e. a majority of instances are down (must have (N+1)/2 instances up) where n is inferred from the expected number of scrape targets. The alert should have a better description and suggest possible areas to investigate, namely down control plane nodes or broken networking.
A runbook addition will come later.
Will be backported to 4.7 and 4.6
Need to make sure we account for this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1880759
Also, I suggest renaming this alert to EtcdQuorumLost to more clearly demonstrate the impact. InsufficientMembers doesn't have much context.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.