Bug 2022050

Summary: [BM][IPI] Failed during bootstrap - unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem
Product: OpenShift Container Platform Reporter: Yurii Prokulevych <yprokule>
Component: NetworkingAssignee: Ben Nemec <bnemec>
Networking sub component: runtime-cfg QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: achernet, aos-bugs, bfournie, bnemec, lshilin, mcornea, ncarboni, shardy, stbenjam, vvoronko
Version: 4.9Keywords: Regression
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Timing issue with bootstrap keepalived process management Consequence: Keepalived stopped when it should be running Fix: Prevented multiple keepalived commands from being sent in a short time period, preventing the timing issue Result: Keepalived runs and stops only when it should
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:26:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2049903    

Description Yurii Prokulevych 2021-11-10 16:11:17 UTC
Cluster installation fails during bootstrap fail.
3 masters are up but fail to form a cluster

Kubelet logs on all masters report same error:
----------------------------------------------
Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: E1110 15:02:28.220220    6242 kubelet_node_status.go:95] "Unable to register node with API server" err="Post \"https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/api/v1/nodes\": dial tcp 10.1.208.10:6443: connect: no route to host" node="openshift-master-0"
Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: I1110 15:02:28.220281    6242 csi_plugin.go:1057] Failed to contact API server when waiting for CSINode publishing: Get "https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/apis/storage.k8s.io/v1/csinodes/openshift-master-0": dial tcp 10.1.208.10:6443: connect: no route to host
Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: E1110 15:02:28.220291    6242 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift-master-0?timeout=10s": dial tcp 10.1.208.10:6443: connect: no route to host

Logs from few containers running on a node report next error:
-------------------------------------------------------------
time="2021-11-10T15:00:54Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]"
time="2021-11-10T15:00:54Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver"
time="2021-11-10T15:00:54Z" level=info msg="Failed to get client config" err="invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]"
time="2021-11-10T15:00:54Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
time="2021-11-10T15:00:54Z" level=warning msg="Could not retrieve LB config: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]"

Version:
--------
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.9.6
built from commit 1c538b8949f3a0e5b993e1ae33b9cd799806fa93
release image quay.io/openshift-release-dev/ocp-release@sha256:c9f58ccb8a9085df4eeb23e21ca201d4c7d39bc434786d58a55381e13215a199
release architecture amd64
Platform:



This is disconnected setup with baremetal network using IPv4 and provisioning network IPv6

Comment 3 Ben Nemec 2021-11-16 21:33:02 UTC
Hmm, this is weird. It looks like the problem is that we're sending simultaneous stop and reload messages to keepalived, and apparently the stop is winning even though reload is sent second:

monitor logs:
time="2021-11-10T13:57:25Z" level=info msg="Command message successfully sent to Keepalived container control socket: stop\n"
time="2021-11-10T13:57:25Z" level=info msg="Command message successfully sent to Keepalived container control socket: reload\n"

keepalived logs:
The client sent: stop
The client sent: reload
Wed Nov 10 13:57:25 2021: Stopping

This causes the VIP to be unavailable even though it seems the API has come back up.

My current theory is that something in the main eventloop of the monitor is taking an unusually long time and the handleBootstrapStopKeepalived messages are stacking up and being sent in close proximity to each other instead of with a gap. The stop and reload functions on the keepalived side get called essentially simultaneously and the reload executes before the stop has finished killing the process. As a result, reload sends a signal to a process that's about to die instead of restarting it like we want.

I think we can fix this by just adding a short delay between stop and reload messages. I'll have a patch up shortly.

Comment 10 Ben Nemec 2022-02-02 21:37:05 UTC
*** Bug 2049892 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2022-03-10 16:26:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056