Description of problem: During installation of 4.10 fc builds, agentclusterinstall times out waiting for cluster version to be available. Cluster version got stuck at certain percentage with no progress, some pods are in bad state on cluster including etcd. Version-Release number of selected component (if applicable): 4.10 fc builds (tried 4.10.0-fc.1/2/3) How reproducible: 100% reproducible on a specific server (cnfocto2), this server has no issue installing 4.9 z builds. It was also seen once on a different server. Steps to Reproduce: 1. Trigger du node install via ZTP siteconfig is here: http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/b280ce1d9bf0c203af26ebe18fe382eecb65bcf4/siteconfig/cnfocto2.yaml 2. Wait for install to begin by monitoring agentclusterinstalls 3. Wait for install to complete by monitoring agentclusterinstalls Actual results: # agentclusterinstall timed out at waiting for cluster version state: error stateInfo: 'Timeout while waiting for cluster version to be available: timed out' machineNetwork: - cidr: 10.16.231.0/24 Expected results: ocp installed successfully Additional info: # clusterversion on spoke got stuck for hours with no further progress (in previous installs, it eventually fails): - lastTransitionTime: "2022-01-27T15:22:53Z" message: 'Working towards 4.10.0-fc.3: 370 of 769 done (48% complete)' status: "True" type: Progressing # some pods in bad state including etcd: NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cloud-credential-operator cloud-credential-operator-69b7f7d54f-jf49f 0/2 ContainerCreating 0 133m openshift-cluster-machine-approver machine-approver-755948d859-ccdqv 0/2 ContainerCreating 0 133m openshift-cluster-node-tuning-operator cluster-node-tuning-operator-555f4f9b5f-5nksx 0/1 ContainerCreating 0 133m openshift-cluster-storage-operator csi-snapshot-webhook-c948dc568-ppbrp 0/1 ContainerCreating 0 129m openshift-cluster-version cluster-version-operator-655b45f98b-lzd2z 0/1 ContainerCreating 0 133m openshift-dns-operator dns-operator-f85999bff-fgj2s 0/2 ContainerCreating 0 133m openshift-etcd etcd-cnfocto2.ptp.lab.eng.bos.redhat.com 3/4 CrashLoopBackOff 31 (106s ago) 124m openshift-ingress-operator ingress-operator-5b78cd4b5b-7r65r 0/2 ContainerCreating 0 133m openshift-kube-controller-manager kube-controller-manager-cnfocto2.ptp.lab.eng.bos.redhat.com 3/4 CrashLoopBackOff 26 (4m30s ago) 112m openshift-machine-config-operator machine-config-daemon-2gr8r 0/2 ContainerCreating 0 130m openshift-marketplace marketplace-operator-97478fb5c-rxk4z 0/1 ContainerCreating 0 133m openshift-monitoring cluster-monitoring-operator-7b87d99dd7-lshpg 0/2 ContainerCreating 0 133m openshift-multus ip-reconciler-27388425-ldjvp 0/1 Pending 0 111s openshift-multus multus-admission-controller-crhhm 0/2 ContainerCreating 0 130m openshift-multus network-metrics-daemon-smn2s 0/2 ContainerCreating 0 132m # etcd-health-monitor container fails: - containerID: cri-o://979553aaf386828cb595c5514a3971a5a8f2a52640e4d9c3914ebe240ec4718a image: registry.ran-vcl01.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev@sha256:3fc3fda52bc4c683f7dd30c4c121221b3a12156afea60437b09424ee5b568855 imageID: registry.ran-vcl01.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev@sha256:3fc3fda52bc4c683f7dd30c4c121221b3a12156afea60437b09424ee5b568855 lastState: terminated: containerID: cri-o://979553aaf386828cb595c5514a3971a5a8f2a52640e4d9c3914ebe240ec4718a exitCode: 255 finishedAt: "2022-01-27T16:50:48Z" message: "8s.io/apimachinery.0/pkg/util/wait/wait.go:167 +0x13b\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x12a05f200, 0x0, 0x5, 0xc0001ecfd0)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:133 +0x89\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:90\nk8s.io/apimachinery/pkg/util/wait.Forever(0x0, 0xc00026b140)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:81 +0x28\ncreated by k8s.io/component-base/logs.InitLogs\n\tk8s.io/component-base.0/logs/logs.go:179 +0x85\n\ngoroutine 87 [select]:\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run.func1()\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:139 +0x85\ncreated by github.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:138 +0x18f\n\ngoroutine 88 [IO wait]:\ninternal/poll.runtime_pollWait(0x7f3c9c9b4798, 0x72)\n\truntime/netpoll.go:229 +0x89\ninternal/poll.(*pollDesc).wait(0xc000768000, 0x4, 0x0)\n\tinternal/poll/fd_poll_runtime.go:84 +0x32\ninternal/poll.(*pollDesc).waitRead(...)\n\tinternal/poll/fd_poll_runtime.go:89\ninternal/poll.(*FD).Accept(0xc000768000)\n\tinternal/poll/fd_unix.go:402 +0x22c\nnet.(*netFD).accept(0xc000768000)\n\tnet/fd_unix.go:173 +0x35\nnet.(*TCPListener).accept(0xc000522000)\n\tnet/tcpsock_posix.go:140 +0x28\nnet.(*TCPListener).Accept(0xc000522000)\n\tnet/tcpsock.go:262 +0x3d\nnet/http.(*Server).Serve(0xc0001420e0, {0x296f1a0, 0xc000522000})\n\tnet/http/server.go:3001 +0x394\nnet/http.(*Server).ListenAndServe(0xc0001420e0)\n\tnet/http/server.go:2930 +0x7d\nnet/http.ListenAndServe(...)\n\tnet/http/server.go:3184\ngithub.com/openshift/library-go/pkg/serviceability.StartProfiler.func1()\n" reason: Error startedAt: "2022-01-27T16:50:28Z" name: etcd-health-monitor ready: false restartCount: 21 started: false state: waiting: message: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-cnfocto2.ptp.lab.eng.bos.redhat.com_openshift-etcd(c31234d31c7c177b8475cd874811e26a) reason: CrashLoopBackOff # pod describe: Warning Unhealthy 178m kubelet Readiness probe failed: + unset ETCDCTL_ENDPOINTS + /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://10.16.231.12:0 endpoint health -w json + grep '"health":true' {"level":"warn","ts":1643298097.4124234,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00035ac40/#initially=[unixs://10.16.231.12:0]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial unix 10.16.231.12:0: connect: no such file or directory\""} Error: unhealthy cluster