+++ This bug was initially created as a clone of Bug #1817382 +++ During rebase to Kubernetes 1.18, kubelet crashes here: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_plugin.go#L285 The reason is that TLS bootstrap[1] can take longer than 60 seconds to establish communication with API server and kubelet is not able to publish CSINode. 1: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/ CSI volume plugin should start waiting for CSINode *after* communication to API server is fully established. --- Additional comment from Jan Safranek on 2020-03-26 09:39:08 UTC --- As part of the fix, we should revert this temporary hack: https://github.com/marun/origin/pull/4 --- Additional comment from Maru Newby on 2020-03-27 05:51:38 UTC --- (In reply to Jan Safranek from comment #0) > > CSI volume plugin should start waiting for CSINode *after* communication to > API server is fully established. An upstream PR proposes to wait for a discovery call to the apiserver to succeed: https://github.com/kubernetes/kubernetes/pull/88000 This does not address the reported issue since the bootstrap configuration will be sufficient for a discovery request to succeed. A proper fix would likely be that when TLS bootstrapping is enabled but not yet complete (i.e. the client CSR has yet to be approved), the CSI plugin should treat permission errors resulting from API calls as recoverable rather than fatal. --- Additional comment from Jan Safranek on 2020-03-27 15:40:43 UTC --- This could fix the issue: https://github.com/kubernetes/kubernetes/pull/88000/
The upstream PR is merged.
*** Bug 1816732 has been marked as a duplicate of this bug. ***
*** Bug 1820508 has been marked as a duplicate of this bug. ***
Block Scale up RHEL on GCP, OSP, Bare Metal
verification failed. Installed an IPI on OSP OCP44 cluster with payload image: 4.4.0-0.nightly-2020-04-09-220855 One of master node is in "NotReady" status, checked the kubelet.service log and got the following error msg: Apr 13 02:10:30 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:30.998201 1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found Apr 13 02:10:31 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:31.355272 1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found Apr 13 02:10:31 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:31.795893 1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found Apr 13 02:10:32 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:32.713936 1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found $ oc get nodes NAME STATUS ROLES AGE VERSION piqin-0413-wzkkx-master-0 NotReady master 163m v1.17.1 piqin-0413-wzkkx-master-1 Ready master 163m v1.17.1 piqin-0413-wzkkx-master-2 Ready master 163m v1.17.1 piqin-0413-wzkkx-worker-2rcb4 Ready worker 150m v1.17.1 piqin-0413-wzkkx-worker-7hd7k Ready worker 152m v1.17.1 piqin-0413-wzkkx-worker-b4btb Ready worker 150m v1.17.1
Jan, Thank for you check, I rebuild a cluster to verify this. 1. Disabled kube-apiserver and kube-controller-manager 2. Restarted the kubelet service on one master node 3. After about 20 minutes, enabled kube-apiserver and kube-controller-manager 4. kubelet service restart successfully, and master node status becomes ready. So, I'll mark this bug as verified. Verified version: 4.4.0-0.nightly-2020-04-09-220855
Also verified the RHEL worker scale-up working well for OCP on GCP/OSP/Bare-Metal. Verified version: openshift-ansible-4.4.0-202004040654.git.0.880a763.el7.noarch.rpm Cluster version : 4.4.0-rc.8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581