Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1944386

Summary: [single-node] etcd: discover-etcd-initial-cluster graceful termination race.
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.8   
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-20 18:52:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1931652    
Bug Blocks: 1951823    

Description OpenShift BugZilla Robot 2021-03-29 20:17:24 UTC
+++ This bug was initially created as a clone of Bug #1931652 +++

Description of problem: discover-etcd-initial-cluster[1] is used as a gating mechanism to ensure that the etcd container starting in etcd static pod is part of the cluster before it starts. To validate this the command performs a client RPC to MemberList using ALL_ETCD_ENDPOINTS env to populate endpoints for client request.

But because with a single node we only have one etcd instance (self) this command is racing against the previous etcd process which is in graceful termination. If the command makes the request after etcd has stopped it will fail, resulting in a catastrophic error that can not be recovered from automatically.

>  C | etcdmain: error setting up initial cluster: URL scheme must be http, https, unix, or unixs:

[1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.8/bindata/etcd/pod.yaml#L122


Version-Release number of selected component (if applicable):


How reproducible: unknown I would guess 10-15%


Steps to Reproduce:
1.install single node
2.
3.

Actual results: race will fail randomly


Expected results: race does not exist.


Additional info:

--- Additional comment from geliu on 2021-02-24 07:58:28 UTC ---

Hello Sam, I also setup a SNO cluster today with pre merge build, and I find there are many error msg in operator log about "quorum of 1 is not tolerant", I suppose the operator have not identify SNO cluster from multi-master cluster. exact?

E0224 02:11:12.210571       1 health.go:215] etcd cluster has quorum of 1 which is not fault tolerant: [{Member:ID:14020120526093556915 name:"ip-10-0-140-142.us-west-2.compute.internal" peerURLs:"https://10.0.140.142:2380" clientURLs:"https://10.0.140.142:2379"  Healthy:true Took:709.719µs Error:<nil>}]

Comment 1 ge liu 2021-03-31 10:12:59 UTC
Verified in pre-merge test with 4.7.0-0.ci.test-2021-03-31-085511-ci-ln-wr15tlb.

Comment 5 errata-xmlrpc 2021-04-20 18:52:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.7 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1149