1880759 – Lost etcd quorum if removed member comes back

Bug 1880759 - Lost etcd quorum if removed member comes back

Summary: Lost etcd quorum if removed member comes back

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Andrea Hoffer
QA Contact:	ge liu
Docs Contact:	Vikram Goyal
URL:
Whiteboard:	UpcomingSprint LifecycleReset
Duplicates (1):	1892413 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-19 18:02 UTC by Michael Gugino
Modified:	2021-05-24 15:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-24 15:46:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michael Gugino 2020-09-19 18:02:07 UTC

Description of problem:

Following the procedure here: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member

I wanted to remove an etcd member from the cluster to test out some things. I followed the command to remove the etcd member. That completed as expected. A short time later, that etcd member was re-added without user intervention. (corresponding user scenario: etcd is having problems due to over utilization and I need to replace this master. etcd might be crash ATM, or might not be, but will be restarted in a few minutes)

I created a new master machine, it joined the etcd cluster automatically (as I found out later)

Aftwards, I deleted the corresponding machine via the machine-api. That worked as expected.

Later, I deleted the new master machine via the machine-api (corresponding user scenario: I attempted to use a bigger instance, but I decided to go even bigger).

Unbeknownst to me at this time, etcd currently has 4 members. 3/4 are healthy. When I delete the newest master, I now have 2/4 healthy and quorum is lost.

Version-Release number of selected component (if applicable):
4.5.

How reproducible:
TBD

Steps to Reproduce:
1. Remove etcd member from healthy master following product docs. This simulates a member that might have been unhealthy and then became healthy (eg, temporary network condition or some other issue)
2. Verify that etcd member is re-added to quorum even though user removed it.
3. Join new master to cluster
4. Verify there are now 4/4 etcd members
5. Delete original master where we removed etcd member via machine-api
6. Verify there are 3/4 healthy etcd members
7. Pretend I don't have enough quota to create an additional master before I delete the one I just created
8. Forget to remove etcd member, or don't forget (TBD).
9. Delete the new master machine via the machine-api
10. API becomes unavailable due to quorum loss.

Actual results:
API unavailable due to quorum loss.

Expected results:
1. Never have 4 etcd members.
2. When an admin removes an etcd member via the established procedure, it never adds itself back.
3. etcd-quorum-guard is only useful if it has the same number of desired replicas as etcd members.

Additional info:
Now, once could argue that there are alerts around this kind of thing, I'm unsure what alerts may have been firing at the time as I did this pretty quickly from the terminal. While alerts are certainly useful, expecting users to refer to current alarms before running a particular set of commands is not great (I certainly failed to do so). In a number of scenarios, if I'm deleting/adding master machines, there's probably alerts going off the entire time, so they're not likely to be high signal/noise during this process.

Comment 2 Sam Batschelet 2020-10-02 19:10:33 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 3 Michal Fojtik 2020-10-21 19:12:07 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 6 Michal Fojtik 2020-11-20 20:12:08 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 7 Michal Fojtik 2021-01-22 16:20:40 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset.
The bug assignee was notified.

Comment 8 Sam Batschelet 2021-01-23 12:14:28 UTC

memberFinalizer

In order to manage scaling correctly, we need a way to conclude that the member has been removed from the cluster. We are able to read the wal logs during init and conclude if we (our member id) have been removed from the cluster. If we observe this condition we need to remove the old etcd state. We are not going to be able to get to this in 4.7 time frame but it should be a prereq for 4.9 scaling epics.

Another option is checking member list but we still must ensure the cluster id of the etcd we are asking membership of is as expected. Otherwise, we could remove etcd state based on observations of the wrong cluster.

Comment 9 Sam Batschelet 2021-01-23 15:57:55 UTC

*** Bug 1892413 has been marked as a duplicate of this bug. ***

Comment 10 Michal Fojtik 2021-02-22 16:48:55 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 11 Michal Fojtik 2021-05-14 17:16:51 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset.
The bug assignee was notified.

Comment 13 Andrea Hoffer 2021-05-17 18:42:21 UTC

PR to add verification step of exactly 3 etcd members: https://github.com/openshift/openshift-docs/pull/32579

Preview: https://deploy-preview-32579--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member

Comment 14 ge liu 2021-05-18 03:16:37 UTC

Andrea, LGTM, thanks, and I can't comment in github because there is problem in my Two-factor authentication recent days, I only have review right.

Comment 15 Andrea Hoffer 2021-05-18 15:04:21 UTC

No worries, thanks @Ge Liu!

Comment 16 Suresh Kolichala 2021-05-18 23:41:43 UTC

Created an RFE for a future enhancement for etcd-operator to avoid readding a recently deleted member.

https://issues.redhat.com/browse/RFE-1870

Comment 17 Andrea Hoffer 2021-05-20 15:49:08 UTC

PR has been merged; moving to RELEASE_PENDING.

Comment 18 Andrea Hoffer 2021-05-24 15:46:05 UTC

Updates are live: https://docs.openshift.com/container-platform/4.7/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member

Note You need to log in before you can comment on or make changes to this bug.