Description of problem: discover-etcd-initial-cluster[1] is used as a gating mechanism to ensure that the etcd container starting in etcd static pod is part of the cluster before it starts. To validate this the command performs a client RPC to MemberList using ALL_ETCD_ENDPOINTS env to populate endpoints for client request. But because with a single node we only have one etcd instance (self) this command is racing against the previous etcd process which is in graceful termination. If the command makes the request after etcd has stopped it will fail, resulting in a catastrophic error that can not be recovered from automatically. > C | etcdmain: error setting up initial cluster: URL scheme must be http, https, unix, or unixs: [1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.8/bindata/etcd/pod.yaml#L122 Version-Release number of selected component (if applicable): How reproducible: unknown I would guess 10-15% Steps to Reproduce: 1.install single node 2. 3. Actual results: race will fail randomly Expected results: race does not exist. Additional info:
Hello Sam, I also setup a SNO cluster today with pre merge build, and I find there are many error msg in operator log about "quorum of 1 is not tolerant", I suppose the operator have not identify SNO cluster from multi-master cluster. exact? E0224 02:11:12.210571 1 health.go:215] etcd cluster has quorum of 1 which is not fault tolerant: [{Member:ID:14020120526093556915 name:"ip-10-0-140-142.us-west-2.compute.internal" peerURLs:"https://10.0.140.142:2380" clientURLs:"https://10.0.140.142:2379" Healthy:true Took:709.719µs Error:<nil>}]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438