1818961 – [4.4] CSI volume plugin crashes when waiting for CSINode annotation

Bug 1818961 - [4.4] CSI volume plugin crashes when waiting for CSINode annotation

Summary: [4.4] CSI volume plugin crashes when waiting for CSINode annotation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Jan Safranek
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1816732 1820508 (view as bug list)
Depends On:	1817382
Blocks:	1816959
TreeView+	depends on / blocked

Reported:	2020-03-30 19:33 UTC by Ryan Phillips
Modified:	2020-05-13 22:01 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1817382
Environment:
Last Closed:	2020-05-13 22:01:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24801	0	None	closed	4.4: Bug 1818961: UPSTREAM: 89589: Wait for APIServer 'ok' forever during CSINode initialization during Kubelet init	2020-12-15 15:50:11 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 22:01:44 UTC

Description Ryan Phillips 2020-03-30 19:33:39 UTC

+++ This bug was initially created as a clone of Bug #1817382 +++

During rebase to Kubernetes 1.18, kubelet crashes here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_plugin.go#L285


The reason is that TLS bootstrap[1] can take longer than 60 seconds to establish communication with API server and kubelet is not able to publish CSINode.

1: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/

CSI volume plugin should start waiting for CSINode *after* communication to API server is fully established.

--- Additional comment from Jan Safranek on 2020-03-26 09:39:08 UTC ---

As part of the fix, we should revert this temporary hack: https://github.com/marun/origin/pull/4

--- Additional comment from Maru Newby on 2020-03-27 05:51:38 UTC ---

(In reply to Jan Safranek from comment #0)
> 
> CSI volume plugin should start waiting for CSINode *after* communication to
> API server is fully established.

An upstream PR proposes to wait for a discovery call to the apiserver to succeed:  

https://github.com/kubernetes/kubernetes/pull/88000

This does not address the reported issue since the bootstrap configuration 
will be sufficient for a discovery request to succeed. 

A proper fix would likely be that when TLS bootstrapping is enabled but not yet 
complete (i.e. the client CSR has yet to be approved), the CSI plugin should treat  
permission errors resulting from API calls as recoverable rather than fatal.

--- Additional comment from Jan Safranek on 2020-03-27 15:40:43 UTC ---

This could fix the issue: https://github.com/kubernetes/kubernetes/pull/88000/

Comment 1 Jan Safranek 2020-04-06 13:59:14 UTC

The upstream PR is merged.

Comment 2 Brenton Leanhardt 2020-04-06 14:41:59 UTC

*** Bug 1816732 has been marked as a duplicate of this bug. ***

Comment 3 Ryan Phillips 2020-04-07 18:48:32 UTC

*** Bug 1820508 has been marked as a duplicate of this bug. ***

Comment 4 Xiaoli Tian 2020-04-08 09:11:37 UTC

Block Scale up RHEL on GCP, OSP, Bare Metal

Comment 10 Qin Ping 2020-04-13 04:54:38 UTC

verification failed.

Installed an IPI on OSP OCP44 cluster with payload image: 4.4.0-0.nightly-2020-04-09-220855

One of master node is in "NotReady" status, checked the kubelet.service log and got the following error msg:
Apr 13 02:10:30 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:30.998201    1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found
Apr 13 02:10:31 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:31.355272    1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found
Apr 13 02:10:31 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:31.795893    1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found
Apr 13 02:10:32 piqin-0413-wzkkx-master-0 hyperkube[1441]: E0413 02:10:32.713936    1441 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: nodes "piqin-0413-wzkkx-master-0" not found


$ oc get nodes
NAME                            STATUS     ROLES    AGE    VERSION
piqin-0413-wzkkx-master-0       NotReady   master   163m   v1.17.1
piqin-0413-wzkkx-master-1       Ready      master   163m   v1.17.1
piqin-0413-wzkkx-master-2       Ready      master   163m   v1.17.1
piqin-0413-wzkkx-worker-2rcb4   Ready      worker   150m   v1.17.1
piqin-0413-wzkkx-worker-7hd7k   Ready      worker   152m   v1.17.1
piqin-0413-wzkkx-worker-b4btb   Ready      worker   150m   v1.17.1

Comment 18 Qin Ping 2020-04-15 06:52:45 UTC

Jan,

Thank for you check, I rebuild a cluster to verify this.

1. Disabled kube-apiserver and kube-controller-manager
2. Restarted the kubelet service on one master node
3. After about 20 minutes, enabled kube-apiserver and kube-controller-manager
4. kubelet service restart successfully, and master node status becomes ready.

So, I'll mark this bug as verified.

Verified version: 4.4.0-0.nightly-2020-04-09-220855

Comment 19 Gaoyun Pei 2020-04-17 09:36:47 UTC

Also verified the RHEL worker scale-up working well for OCP on GCP/OSP/Bare-Metal.

Verified version:
openshift-ansible-4.4.0-202004040654.git.0.880a763.el7.noarch.rpm
Cluster version : 4.4.0-rc.8

Comment 21 errata-xmlrpc 2020-05-13 22:01:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.