Bug 1818961

Summary: [4.4] CSI volume plugin crashes when waiting for CSINode annotation
Product: OpenShift Container Platform Reporter: Ryan Phillips <rphillips>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED ERRATA QA Contact: Qin Ping <piqin>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: aos-bugs, gpei, jsafrane, lbednar, lxia, mnewby, piqin, scuppett, tnozicka, xtian
Target Milestone: ---Keywords: TestBlocker
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1817382 Environment:
Last Closed: 2020-05-13 22:01:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1817382    
Bug Blocks: 1816959    

Description Ryan Phillips 2020-03-30 19:33:39 UTC
+++ This bug was initially created as a clone of Bug #1817382 +++

During rebase to Kubernetes 1.18, kubelet crashes here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_plugin.go#L285


The reason is that TLS bootstrap[1] can take longer than 60 seconds to establish communication with API server and kubelet is not able to publish CSINode.

1: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/

CSI volume plugin should start waiting for CSINode *after* communication to API server is fully established.

--- Additional comment from Jan Safranek on 2020-03-26 09:39:08 UTC ---

As part of the fix, we should revert this temporary hack: https://github.com/marun/origin/pull/4

--- Additional comment from Maru Newby on 2020-03-27 05:51:38 UTC ---

(In reply to Jan Safranek from comment #0)
> 
> CSI volume plugin should start waiting for CSINode *after* communication to
> API server is fully established.

An upstream PR proposes to wait for a discovery call to the apiserver to succeed:  

https://github.com/kubernetes/kubernetes/pull/88000

This does not address the reported issue since the bootstrap configuration 
will be sufficient for a discovery request to succeed. 

A proper fix would likely be that when TLS bootstrapping is enabled but not yet 
complete (i.e. the client CSR has yet to be approved), the CSI plugin should treat  
permission errors resulting from API calls as recoverable rather than fatal.

--- Additional comment from Jan Safranek on 2020-03-27 15:40:43 UTC ---

This could fix the issue: https://github.com/kubernetes/kubernetes/pull/88000/

Comment 1 Jan Safranek 2020-04-06 13:59:14 UTC
The upstream PR is merged.

Comment 2 Brenton Leanhardt 2020-04-06 14:41:59 UTC
*** Bug 1816732 has been marked as a duplicate of this bug. ***

Comment 3 Ryan Phillips 2020-04-07 18:48:32 UTC
*** Bug 1820508 has been marked as a duplicate of this bug. ***

Comment 4 Xiaoli Tian 2020-04-08 09:11:37 UTC
Block Scale up RHEL on GCP, OSP, Bare Metal

Comment 10 Qin Ping 2020-04-13 04:54:38 UTC
verification failed.

Installed an IPI on OSP OCP44 cluster with payload image: 4.4.0-0.nightly-2020-04-09-220855

One of master node is in "NotReady" status, checked the kubelet.service log and got the following error msg:
Apr 13 02:10:30 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:30.998201    1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found
Apr 13 02:10:31 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:31.355272    1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found
Apr 13 02:10:31 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:31.795893    1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found
Apr 13 02:10:32 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:32.713936    1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found


$ oc get nodes
NAME                            STATUS     ROLES    AGE    VERSION
piqin-0413-wzkkx-master-0       NotReady   master   163m   v1.17.1
piqin-0413-wzkkx-master-1       Ready      master   163m   v1.17.1
piqin-0413-wzkkx-master-2       Ready      master   163m   v1.17.1
piqin-0413-wzkkx-worker-2rcb4   Ready      worker   150m   v1.17.1
piqin-0413-wzkkx-worker-7hd7k   Ready      worker   152m   v1.17.1
piqin-0413-wzkkx-worker-b4btb   Ready      worker   150m   v1.17.1

Comment 18 Qin Ping 2020-04-15 06:52:45 UTC
Jan,

Thank for you check, I rebuild a cluster to verify this.

1. Disabled kube-apiserver and kube-controller-manager
2. Restarted the kubelet service on one master node
3. After about 20 minutes, enabled kube-apiserver and kube-controller-manager
4. kubelet service restart successfully, and master node status becomes ready.

So, I'll mark this bug as verified.

Verified version: 4.4.0-0.nightly-2020-04-09-220855

Comment 19 Gaoyun Pei 2020-04-17 09:36:47 UTC
Also verified the RHEL worker scale-up working well for OCP on GCP/OSP/Bare-Metal.

Verified version:
openshift-ansible-4.4.0-202004040654.git.0.880a763.el7.noarch.rpm
Cluster version : 4.4.0-rc.8

Comment 21 errata-xmlrpc 2020-05-13 22:01:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581