During rebase to Kubernetes 1.18, kubelet crashes here: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_plugin.go#L285 The reason is that TLS bootstrap[1] can take longer than 60 seconds to establish communication with API server and kubelet is not able to publish CSINode. 1: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/ CSI volume plugin should start waiting for CSINode *after* communication to API server is fully established.
As part of the fix, we should revert this temporary hack: https://github.com/marun/origin/pull/4
(In reply to Jan Safranek from comment #0) > > CSI volume plugin should start waiting for CSINode *after* communication to > API server is fully established. An upstream PR proposes to wait for a discovery call to the apiserver to succeed: https://github.com/kubernetes/kubernetes/pull/88000 This does not address the reported issue since the bootstrap configuration will be sufficient for a discovery request to succeed. A proper fix would likely be that when TLS bootstrapping is enabled but not yet complete (i.e. the client CSR has yet to be approved), the CSI plugin should treat permission errors resulting from API calls as recoverable rather than fatal.
This could fix the issue: https://github.com/kubernetes/kubernetes/pull/88000/
*** Bug 1811221 has been marked as a duplicate of this bug. ***
*** Bug 1812787 has been marked as a duplicate of this bug. ***
Hi Jan, There are still 2 failed volume unit test cases[1], could you help take a look? [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24719/pull-ci-openshift-origin-master-unit/12644
> (In reply to Qin Ping from comment #8) > There are still 2 failed volume unit test cases[1], could you help take a > look? > > [1] > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24719/ > pull-ci-openshift-origin-master-unit/12644 These unit test failures are not releated to this bug. They should be fixed by https://github.com/openshift/origin/pull/24719/commits/2ef0fca658d07601d60454548a8d422c3eb8965c in 4.5 rebase PR.
1. Deleted the images of kube-apiserver, kube-controller-manager from master nodes, and killed the containers. 2. restarted kubelet service 3. repeated step 1 about 120s, in this time, can not access the kube-apiserver 4. After 120s, kube-apiserver and kube-controller-manager recovered and all the kubelet services restarted successfully, So, marked this bug as verified.
Verifcation version: 4.5.0-0.nightly-2020-04-01-045338
*** Bug 1815010 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409