Bug 2022050 - [BM][IPI] Failed during bootstrap - unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem
Summary: [BM][IPI] Failed during bootstrap - unable to read client-key /var/lib/kubele...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.10.0
Assignee: Ben Nemec
QA Contact: Victor Voronkov
URL:
Whiteboard:
: 2049892 (view as bug list)
Depends On:
Blocks: 2049903
TreeView+ depends on / blocked
 
Reported: 2021-11-10 16:11 UTC by Yurii Prokulevych
Modified: 2022-03-10 16:27 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Timing issue with bootstrap keepalived process management Consequence: Keepalived stopped when it should be running Fix: Prevented multiple keepalived commands from being sent in a short time period, preventing the timing issue Result: Keepalived runs and stops only when it should
Clone Of:
Environment:
Last Closed: 2022-03-10 16:26:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-runtimecfg pull 158 0 None open Bug 2022050: Add delay after sending bootstrap stop and start messages 2021-11-16 21:44:48 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:27:16 UTC

Description Yurii Prokulevych 2021-11-10 16:11:17 UTC
Cluster installation fails during bootstrap fail.
3 masters are up but fail to form a cluster

Kubelet logs on all masters report same error:
----------------------------------------------
Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: E1110 15:02:28.220220    6242 kubelet_node_status.go:95] "Unable to register node with API server" err="Post \"https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/api/v1/nodes\": dial tcp 10.1.208.10:6443: connect: no route to host" node="openshift-master-0"
Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: I1110 15:02:28.220281    6242 csi_plugin.go:1057] Failed to contact API server when waiting for CSINode publishing: Get "https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/apis/storage.k8s.io/v1/csinodes/openshift-master-0": dial tcp 10.1.208.10:6443: connect: no route to host
Nov 10 15:02:28 openshift-master-0 hyperkube[6242]: E1110 15:02:28.220291    6242 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://api-int.kni-qe-4.lab.eng.rdu2.redhat.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift-master-0?timeout=10s": dial tcp 10.1.208.10:6443: connect: no route to host

Logs from few containers running on a node report next error:
-------------------------------------------------------------
time="2021-11-10T15:00:54Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]"
time="2021-11-10T15:00:54Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver"
time="2021-11-10T15:00:54Z" level=info msg="Failed to get client config" err="invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]"
time="2021-11-10T15:00:54Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
time="2021-11-10T15:00:54Z" level=warning msg="Could not retrieve LB config: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]"

Version:
--------
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.9.6
built from commit 1c538b8949f3a0e5b993e1ae33b9cd799806fa93
release image quay.io/openshift-release-dev/ocp-release@sha256:c9f58ccb8a9085df4eeb23e21ca201d4c7d39bc434786d58a55381e13215a199
release architecture amd64
Platform:



This is disconnected setup with baremetal network using IPv4 and provisioning network IPv6

Comment 3 Ben Nemec 2021-11-16 21:33:02 UTC
Hmm, this is weird. It looks like the problem is that we're sending simultaneous stop and reload messages to keepalived, and apparently the stop is winning even though reload is sent second:

monitor logs:
time="2021-11-10T13:57:25Z" level=info msg="Command message successfully sent to Keepalived container control socket: stop\n"
time="2021-11-10T13:57:25Z" level=info msg="Command message successfully sent to Keepalived container control socket: reload\n"

keepalived logs:
The client sent: stop
The client sent: reload
Wed Nov 10 13:57:25 2021: Stopping

This causes the VIP to be unavailable even though it seems the API has come back up.

My current theory is that something in the main eventloop of the monitor is taking an unusually long time and the handleBootstrapStopKeepalived messages are stacking up and being sent in close proximity to each other instead of with a gap. The stop and reload functions on the keepalived side get called essentially simultaneously and the reload executes before the stop has finished killing the process. As a result, reload sends a signal to a process that's about to die instead of restarting it like we want.

I think we can fix this by just adding a short delay between stop and reload messages. I'll have a patch up shortly.

Comment 10 Ben Nemec 2022-02-02 21:37:05 UTC
*** Bug 2049892 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2022-03-10 16:26:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.