1931652 – [single-node] etcd: discover-etcd-initial-cluster graceful termination race.

Bug 1931652 - [single-node] etcd: discover-etcd-initial-cluster graceful termination race.

Summary: [single-node] etcd: discover-etcd-initial-cluster graceful termination race.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1944386
TreeView+	depends on / blocked

Reported:	2021-02-22 21:09 UTC by Sam Batschelet
Modified:	2021-07-27 22:48 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:47:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift etcd pull 73	0	None	open	Bug 1931652: openshift-tools: fix on off flow and add unit tests	2021-03-18 18:29:26 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:48:13 UTC

Description Sam Batschelet 2021-02-22 21:09:25 UTC

Description of problem: discover-etcd-initial-cluster[1] is used as a gating mechanism to ensure that the etcd container starting in etcd static pod is part of the cluster before it starts. To validate this the command performs a client RPC to MemberList using ALL_ETCD_ENDPOINTS env to populate endpoints for client request.

But because with a single node we only have one etcd instance (self) this command is racing against the previous etcd process which is in graceful termination. If the command makes the request after etcd has stopped it will fail, resulting in a catastrophic error that can not be recovered from automatically.

>  C | etcdmain: error setting up initial cluster: URL scheme must be http, https, unix, or unixs:

[1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.8/bindata/etcd/pod.yaml#L122


Version-Release number of selected component (if applicable):


How reproducible: unknown I would guess 10-15%


Steps to Reproduce:
1.install single node
2.
3.

Actual results: race will fail randomly


Expected results: race does not exist.


Additional info:

Comment 1 ge liu 2021-02-24 07:58:28 UTC

Hello Sam, I also setup a SNO cluster today with pre merge build, and I find there are many error msg in operator log about "quorum of 1 is not tolerant", I suppose the operator have not identify SNO cluster from multi-master cluster. exact?

E0224 02:11:12.210571       1 health.go:215] etcd cluster has quorum of 1 which is not fault tolerant: [{Member:ID:14020120526093556915 name:"ip-10-0-140-142.us-west-2.compute.internal" peerURLs:"https://10.0.140.142:2380" clientURLs:"https://10.0.140.142:2379"  Healthy:true Took:709.719µs Error:<nil>}]

Comment 5 errata-xmlrpc 2021-07-27 22:47:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.