Cluster installation fails during bootstrap fail. 3 masters are up but fail to form a cluster Kubelet logs on all masters report same error: ---------------------------------------------- Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: E1110 15:02:28.220220 6242 kubelet_node_status.go:95] "Unable to register node with API server" err="Post \"https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/api/v1/nodes\": dial tcp 10.1.208.10:6443: connect: no route to host" node="openshift-master-0" Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: I1110 15:02:28.220281 6242 csi_plugin.go:1057] Failed to contact API server when waiting for CSINode publishing: Get "https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/apis/storage.k8s.io/v1/csinodes/openshift-master-0": dial tcp 10.1.208.10:6443: connect: no route to host Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: E1110 15:02:28.220291 6242 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift-master-0?timeout=10s": dial tcp 10.1.208.10:6443: connect: no route to host Logs from few containers running on a node report next error: ------------------------------------------------------------- time="2021-11-10T15:00:54Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]" time="2021-11-10T15:00:54Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver" time="2021-11-10T15:00:54Z" level=info msg="Failed to get client config" err="invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]" time="2021-11-10T15:00:54Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig time="2021-11-10T15:00:54Z" level=warning msg="Could not retrieve LB config: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]" Version: -------- $ ./openshift-baremetal-install version ./openshift-baremetal-install 4.9.6 built from commit 1c538b8949f3a0e5b993e1ae33b9cd799806fa93 release image quay.io/openshift-release-dev/ocp-release@sha256:c9f58ccb8a9085df4eeb23e21ca201d4c7d39bc434786d58a55381e13215a199 release architecture amd64 Platform: This is disconnected setup with baremetal network using IPv4 and provisioning network IPv6
Hmm, this is weird. It looks like the problem is that we're sending simultaneous stop and reload messages to keepalived, and apparently the stop is winning even though reload is sent second: monitor logs: time="2021-11-10T13:57:25Z" level=info msg="Command message successfully sent to Keepalived container control socket: stop\n" time="2021-11-10T13:57:25Z" level=info msg="Command message successfully sent to Keepalived container control socket: reload\n" keepalived logs: The client sent: stop The client sent: reload Wed Nov 10 13:57:25 2021: Stopping This causes the VIP to be unavailable even though it seems the API has come back up. My current theory is that something in the main eventloop of the monitor is taking an unusually long time and the handleBootstrapStopKeepalived messages are stacking up and being sent in close proximity to each other instead of with a gap. The stop and reload functions on the keepalived side get called essentially simultaneously and the reload executes before the stop has finished killing the process. As a result, reload sends a signal to a process that's about to die instead of restarting it like we want. I think we can fix this by just adding a short delay between stop and reload messages. I'll have a patch up shortly.
*** Bug 2049892 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056