Bug 1880759
Summary: | Lost etcd quorum if removed member comes back | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Michael Gugino <mgugino> |
Component: | Documentation | Assignee: | Andrea Hoffer <ahoffer> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ge liu <geliu> |
Severity: | medium | Docs Contact: | Vikram Goyal <vigoyal> |
Priority: | medium | ||
Version: | 4.5 | CC: | aos-bugs, jokerman, kboumedh, sbatsche, skolicha, wlewis |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | UpcomingSprint LifecycleReset | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-24 15:46:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Michael Gugino
2020-09-19 18:02:07 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. The LifecycleStale keyword was removed because the needinfo? flag was reset. The bug assignee was notified. memberFinalizer In order to manage scaling correctly, we need a way to conclude that the member has been removed from the cluster. We are able to read the wal logs during init and conclude if we (our member id) have been removed from the cluster. If we observe this condition we need to remove the old etcd state. We are not going to be able to get to this in 4.7 time frame but it should be a prereq for 4.9 scaling epics. Another option is checking member list but we still must ensure the cluster id of the etcd we are asking membership of is as expected. Otherwise, we could remove etcd state based on observations of the wrong cluster. *** Bug 1892413 has been marked as a duplicate of this bug. *** This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. The LifecycleStale keyword was removed because the needinfo? flag was reset. The bug assignee was notified. PR to add verification step of exactly 3 etcd members: https://github.com/openshift/openshift-docs/pull/32579 Preview: https://deploy-preview-32579--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member Andrea, LGTM, thanks, and I can't comment in github because there is problem in my Two-factor authentication recent days, I only have review right. No worries, thanks @Ge Liu! Created an RFE for a future enhancement for etcd-operator to avoid readding a recently deleted member. https://issues.redhat.com/browse/RFE-1870 PR has been merged; moving to RELEASE_PENDING. |