Bug 1944386 - [single-node] etcd: discover-etcd-initial-cluster graceful termination race.
Summary: [single-node] etcd: discover-etcd-initial-cluster graceful termination race.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 1931652
Blocks: 1951823
TreeView+ depends on / blocked
 
Reported: 2021-03-29 20:17 UTC by OpenShift BugZilla Robot
Modified: 2021-04-20 22:32 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-20 18:52:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift etcd pull 74 0 None open [openshift-4.7] ETCD-178: Bug 1944386: openshift-tools: fix on off flow and add unit tests 2021-03-31 01:00:52 UTC
Red Hat Product Errata RHBA-2021:1149 0 None None None 2021-04-20 18:52:59 UTC

Description OpenShift BugZilla Robot 2021-03-29 20:17:24 UTC
+++ This bug was initially created as a clone of Bug #1931652 +++

Description of problem: discover-etcd-initial-cluster[1] is used as a gating mechanism to ensure that the etcd container starting in etcd static pod is part of the cluster before it starts. To validate this the command performs a client RPC to MemberList using ALL_ETCD_ENDPOINTS env to populate endpoints for client request.

But because with a single node we only have one etcd instance (self) this command is racing against the previous etcd process which is in graceful termination. If the command makes the request after etcd has stopped it will fail, resulting in a catastrophic error that can not be recovered from automatically.

>  C | etcdmain: error setting up initial cluster: URL scheme must be http, https, unix, or unixs:

[1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.8/bindata/etcd/pod.yaml#L122


Version-Release number of selected component (if applicable):


How reproducible: unknown I would guess 10-15%


Steps to Reproduce:
1.install single node
2.
3.

Actual results: race will fail randomly


Expected results: race does not exist.


Additional info:

--- Additional comment from geliu on 2021-02-24 07:58:28 UTC ---

Hello Sam, I also setup a SNO cluster today with pre merge build, and I find there are many error msg in operator log about "quorum of 1 is not tolerant", I suppose the operator have not identify SNO cluster from multi-master cluster. exact?

E0224 02:11:12.210571       1 health.go:215] etcd cluster has quorum of 1 which is not fault tolerant: [{Member:ID:14020120526093556915 name:"ip-10-0-140-142.us-west-2.compute.internal" peerURLs:"https://10.0.140.142:2380" clientURLs:"https://10.0.140.142:2379"  Healthy:true Took:709.719µs Error:<nil>}]

Comment 1 ge liu 2021-03-31 10:12:59 UTC
Verified in pre-merge test with 4.7.0-0.ci.test-2021-03-31-085511-ci-ln-wr15tlb.

Comment 5 errata-xmlrpc 2021-04-20 18:52:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.7 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1149


Note You need to log in before you can comment on or make changes to this bug.