Bug 1817382

Summary: CSI volume plugin crashes when waiting for CSINode annotation
Product: OpenShift Container Platform Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED ERRATA QA Contact: Qin Ping <piqin>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: aos-bugs, krapohl, mnewby, piqin, yanyang
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1818961 (view as bug list) Environment:
Last Closed: 2020-07-13 17:23:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1812787, 1818961    

Description Jan Safranek 2020-03-26 09:21:42 UTC
During rebase to Kubernetes 1.18, kubelet crashes here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_plugin.go#L285


The reason is that TLS bootstrap[1] can take longer than 60 seconds to establish communication with API server and kubelet is not able to publish CSINode.

1: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/

CSI volume plugin should start waiting for CSINode *after* communication to API server is fully established.

Comment 1 Jan Safranek 2020-03-26 09:39:08 UTC
As part of the fix, we should revert this temporary hack: https://github.com/marun/origin/pull/4

Comment 2 Maru Newby 2020-03-27 05:51:38 UTC
(In reply to Jan Safranek from comment #0)
> 
> CSI volume plugin should start waiting for CSINode *after* communication to
> API server is fully established.

An upstream PR proposes to wait for a discovery call to the apiserver to succeed:  

https://github.com/kubernetes/kubernetes/pull/88000

This does not address the reported issue since the bootstrap configuration 
will be sufficient for a discovery request to succeed. 

A proper fix would likely be that when TLS bootstrapping is enabled but not yet 
complete (i.e. the client CSR has yet to be approved), the CSI plugin should treat  
permission errors resulting from API calls as recoverable rather than fatal.

Comment 3 Jan Safranek 2020-03-27 15:40:43 UTC
This could fix the issue: https://github.com/kubernetes/kubernetes/pull/88000/

Comment 4 Ryan Phillips 2020-03-31 18:14:03 UTC
*** Bug 1811221 has been marked as a duplicate of this bug. ***

Comment 5 Ryan Phillips 2020-03-31 18:14:37 UTC
*** Bug 1812787 has been marked as a duplicate of this bug. ***

Comment 8 Qin Ping 2020-04-01 07:27:14 UTC
Hi Jan,

There are still 2 failed volume unit test cases[1], could you help take a look?

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24719/pull-ci-openshift-origin-master-unit/12644

Comment 9 Jan Safranek 2020-04-01 07:36:26 UTC
> (In reply to Qin Ping from comment #8)
> There are still 2 failed volume unit test cases[1], could you help take a
> look?
> 
> [1]
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24719/
> pull-ci-openshift-origin-master-unit/12644

These unit test failures are not releated to this bug. They should be fixed by https://github.com/openshift/origin/pull/24719/commits/2ef0fca658d07601d60454548a8d422c3eb8965c in 4.5 rebase PR.

Comment 10 Qin Ping 2020-04-01 09:56:21 UTC
1. Deleted the images of kube-apiserver, kube-controller-manager from master nodes, and killed the containers.
2. restarted kubelet service
3. repeated step 1 about 120s, in this time, can not access the kube-apiserver
4. After 120s, kube-apiserver and kube-controller-manager recovered and all the kubelet services restarted successfully,

So, marked this bug as verified.

Comment 11 Qin Ping 2020-04-01 09:59:23 UTC
Verifcation version: 4.5.0-0.nightly-2020-04-01-045338

Comment 12 Scott Dodson 2020-05-13 12:08:33 UTC
*** Bug 1815010 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2020-07-13 17:23:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409