Bug 1931652 - [single-node] etcd: discover-etcd-initial-cluster graceful termination race.
Summary: [single-node] etcd: discover-etcd-initial-cluster graceful termination race.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1944386
TreeView+ depends on / blocked
 
Reported: 2021-02-22 21:09 UTC by Sam Batschelet
Modified: 2021-07-27 22:48 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:47:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift etcd pull 73 0 None open Bug 1931652: openshift-tools: fix on off flow and add unit tests 2021-03-18 18:29:26 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:48:13 UTC

Description Sam Batschelet 2021-02-22 21:09:25 UTC
Description of problem: discover-etcd-initial-cluster[1] is used as a gating mechanism to ensure that the etcd container starting in etcd static pod is part of the cluster before it starts. To validate this the command performs a client RPC to MemberList using ALL_ETCD_ENDPOINTS env to populate endpoints for client request.

But because with a single node we only have one etcd instance (self) this command is racing against the previous etcd process which is in graceful termination. If the command makes the request after etcd has stopped it will fail, resulting in a catastrophic error that can not be recovered from automatically.

>  C | etcdmain: error setting up initial cluster: URL scheme must be http, https, unix, or unixs:

[1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.8/bindata/etcd/pod.yaml#L122


Version-Release number of selected component (if applicable):


How reproducible: unknown I would guess 10-15%


Steps to Reproduce:
1.install single node
2.
3.

Actual results: race will fail randomly


Expected results: race does not exist.


Additional info:

Comment 1 ge liu 2021-02-24 07:58:28 UTC
Hello Sam, I also setup a SNO cluster today with pre merge build, and I find there are many error msg in operator log about "quorum of 1 is not tolerant", I suppose the operator have not identify SNO cluster from multi-master cluster. exact?

E0224 02:11:12.210571       1 health.go:215] etcd cluster has quorum of 1 which is not fault tolerant: [{Member:ID:14020120526093556915 name:"ip-10-0-140-142.us-west-2.compute.internal" peerURLs:"https://10.0.140.142:2380" clientURLs:"https://10.0.140.142:2379"  Healthy:true Took:709.719µs Error:<nil>}]

Comment 5 errata-xmlrpc 2021-07-27 22:47:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.