Description of problem: Today etcd will wait for the ports[1] to be released from the previous process until it starts again. While this seems logical it takes a very long time. SO_REUSEADDR allows for the port to be reused by multiple processes and support has been available since golang 1.11. # example of port wait logging from etcd container > Waiting for ports 2379, 2380 and 9978 to be released.............................................................ETCD_PORT_2379_TCP_PORT=2379 each dot here represents 1-second sleep ~62 seconds. kube-apiserver already has this ability[2]. consideration and care should be taken to reduce exposure to etcd attempting to read already flocked data file. While we will block and wait, bugs in the past have existed and we can be smarter during init. # logging of etcd blocking start if another process is holding lock on data file. > {"level":"info","ts":"2021-02-11T20:02:06.532Z","caller":"etcdserver/backend.go:86","msg":"db file is flocked by another process, or taking too long","path":"/var/lib/etcd/member/snap/db","took":"10.000115249s"} [1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.7/bindata/etcd/pod.yaml#L135 [2] https://github.com/kubernetes/kubernetes/pull/88893 Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. kill etcd process (oc -n openshift-etcd rsh -c etcd -T $pod kill 1) and tail logs of new etcd process. 2. 3. Actual results: etcd waits a considerable amount of time before it can start new process. Expected results: etcd process should recover quickly and in the case of quorum loss (killl majority of etcd processes) "<5s and p95" Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438