Bug 1817382 - CSI volume plugin crashes when waiting for CSINode annotation
Summary: CSI volume plugin crashes when waiting for CSINode annotation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Jan Safranek
QA Contact: Qin Ping
URL:
Whiteboard:
: 1812787 1815010 (view as bug list)
Depends On:
Blocks: 1812787 1818961
TreeView+ depends on / blocked
 
Reported: 2020-03-26 09:21 UTC by Jan Safranek
Modified: 2020-07-13 17:24 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1818961 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:23:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24798 0 None closed Bug 1817382: UPSTREAM: 89589: Wait for APIServer 'ok' forever during CSINode initialization during Kubelet init 2021-01-12 05:08:36 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:23:59 UTC

Description Jan Safranek 2020-03-26 09:21:42 UTC
During rebase to Kubernetes 1.18, kubelet crashes here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_plugin.go#L285


The reason is that TLS bootstrap[1] can take longer than 60 seconds to establish communication with API server and kubelet is not able to publish CSINode.

1: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/

CSI volume plugin should start waiting for CSINode *after* communication to API server is fully established.

Comment 1 Jan Safranek 2020-03-26 09:39:08 UTC
As part of the fix, we should revert this temporary hack: https://github.com/marun/origin/pull/4

Comment 2 Maru Newby 2020-03-27 05:51:38 UTC
(In reply to Jan Safranek from comment #0)
> 
> CSI volume plugin should start waiting for CSINode *after* communication to
> API server is fully established.

An upstream PR proposes to wait for a discovery call to the apiserver to succeed:  

https://github.com/kubernetes/kubernetes/pull/88000

This does not address the reported issue since the bootstrap configuration 
will be sufficient for a discovery request to succeed. 

A proper fix would likely be that when TLS bootstrapping is enabled but not yet 
complete (i.e. the client CSR has yet to be approved), the CSI plugin should treat  
permission errors resulting from API calls as recoverable rather than fatal.

Comment 3 Jan Safranek 2020-03-27 15:40:43 UTC
This could fix the issue: https://github.com/kubernetes/kubernetes/pull/88000/

Comment 4 Ryan Phillips 2020-03-31 18:14:03 UTC
*** Bug 1811221 has been marked as a duplicate of this bug. ***

Comment 5 Ryan Phillips 2020-03-31 18:14:37 UTC
*** Bug 1812787 has been marked as a duplicate of this bug. ***

Comment 8 Qin Ping 2020-04-01 07:27:14 UTC
Hi Jan,

There are still 2 failed volume unit test cases[1], could you help take a look?

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24719/pull-ci-openshift-origin-master-unit/12644

Comment 9 Jan Safranek 2020-04-01 07:36:26 UTC
> (In reply to Qin Ping from comment #8)
> There are still 2 failed volume unit test cases[1], could you help take a
> look?
> 
> [1]
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24719/
> pull-ci-openshift-origin-master-unit/12644

These unit test failures are not releated to this bug. They should be fixed by https://github.com/openshift/origin/pull/24719/commits/2ef0fca658d07601d60454548a8d422c3eb8965c in 4.5 rebase PR.

Comment 10 Qin Ping 2020-04-01 09:56:21 UTC
1. Deleted the images of kube-apiserver, kube-controller-manager from master nodes, and killed the containers.
2. restarted kubelet service
3. repeated step 1 about 120s, in this time, can not access the kube-apiserver
4. After 120s, kube-apiserver and kube-controller-manager recovered and all the kubelet services restarted successfully,

So, marked this bug as verified.

Comment 11 Qin Ping 2020-04-01 09:59:23 UTC
Verifcation version: 4.5.0-0.nightly-2020-04-01-045338

Comment 12 Scott Dodson 2020-05-13 12:08:33 UTC
*** Bug 1815010 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2020-07-13 17:23:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.