Bug 1902307
| Summary: | [vSphere] cloud labels management via cloud provider makes nodes not ready | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pietro Bertera <pbertera> |
| Component: | Cloud Compute | Assignee: | dmoiseev |
| Cloud Compute sub component: | Cloud Controller Manager | QA Contact: | Huali Liu <huliu> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aos-bugs, dmoiseev, mfedosin, mimccune, rkant |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause:
Getting nodes zone labels requires to contact vCenter for obtain that labels values. Due to kubelet tries to do so in a very early initialisation steps it can not read vCenter credentials from the secrets.
Consequence:
In case if vCenter credentials stored in secret and regions/zone parameters are presented in cloud.conf, kubelet can not start, due to lack of credentials for vCenter for obtaining zone/region label values.
Fix:
For vSphere platform with set up secret region and zone labels population was moved out of kubelet initialization sequence to the kube-controller-manager part of cloud provider code.
Result:
Region and zone labels works properly and do not cause kubelet hanging with credentials in secret now.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 10:35:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
There has been some motion on the upstream issue recently, looks like a fix may be in the pipeline. I suggest we wait for the moment to see if anything happens there We believe that this issue should be resolved as part of the out of tree cloud provider migration. We are currently aiming for a technical preview for vSphere in 4.10. Until then, we will try to mitigate the issue as much as possible via the proposed upstream patch, this won't fully resolve the issue however. We need to find someone upstream from the vSphere community to review the upstream PR. Nothing will be happening downstream with this for now. @Denis, when you are back, could you please take a look at the upstream PR, there was some feedback from cheftako that hasn't been address. Perhaps if we can get those comments addressed we can make some progress on this for the next release No new feedback/comments there. upstream PR still waiting for some meaningful reviews. *** Bug 2009037 has been marked as a duplicate of this bug. *** The upstream PR has merged, this will be included in the cloud provider code once a rebase to 1.24 happens. Nothing we can do with this bug until the rebase occurs in a couple of sprints This is now waiting on the rebase to merge We need to set up a RBAC for the fix within KCMO Verified on 4.11.0-0.nightly-2022-06-21-151125
Steps:
1. install an OCP cluster on vSphere
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-21-151125 True False 19m Cluster version is 4.11.0-0.nightly-2022-06-21-151125
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME STATUS ROLES AGE VERSION
huliu-vs411-d5vqm-master-0 Ready master 41m v1.24.0+284d62a
huliu-vs411-d5vqm-master-1 Ready master 41m v1.24.0+284d62a
huliu-vs411-d5vqm-master-2 Ready master 41m v1.24.0+284d62a
huliu-vs411-d5vqm-worker-gmwww Ready worker 29m v1.24.0+284d62a
huliu-vs411-d5vqm-worker-zfn9p Ready worker 29m v1.24.0+284d62a
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-vs411-d5vqm-master-0 Running 42m
huliu-vs411-d5vqm-master-1 Running 42m
huliu-vs411-d5vqm-master-2 Running 42m
huliu-vs411-d5vqm-worker-gmwww Running 39m
huliu-vs411-d5vqm-worker-zfn9p Running 39m
2. edit the configMap cloud-provider-config adding [Labels] section
liuhuali@Lius-MacBook-Pro huali-test % oc edit cm cloud-provider-config -n openshift-config
configmap/cloud-provider-config edited
...
[Labels]
region = k8s-region
zone = k8s-zone
...
3. wait all nodes restart and get Ready again.
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME STATUS ROLES AGE VERSION
huliu-vs411-d5vqm-master-0 Ready master 117m v1.24.0+284d62a
huliu-vs411-d5vqm-master-1 Ready master 117m v1.24.0+284d62a
huliu-vs411-d5vqm-master-2 Ready,SchedulingDisabled master 117m v1.24.0+284d62a
huliu-vs411-d5vqm-worker-gmwww Ready worker 105m v1.24.0+284d62a
huliu-vs411-d5vqm-worker-zfn9p Ready worker 105m v1.24.0+284d62a
liuhuali@Lius-MacBook-Pro huali-test %
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME STATUS ROLES AGE VERSION
huliu-vs411-d5vqm-master-0 Ready master 125m v1.24.0+284d62a
huliu-vs411-d5vqm-master-1 Ready master 125m v1.24.0+284d62a
huliu-vs411-d5vqm-master-2 Ready master 125m v1.24.0+284d62a
huliu-vs411-d5vqm-worker-gmwww Ready worker 113m v1.24.0+284d62a
huliu-vs411-d5vqm-worker-zfn9p Ready worker 113m v1.24.0+284d62a
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-vs411-d5vqm-master-0 Running 126m
huliu-vs411-d5vqm-master-1 Running 126m
huliu-vs411-d5vqm-master-2 Running 126m
huliu-vs411-d5vqm-worker-gmwww Running 123m
huliu-vs411-d5vqm-worker-zfn9p Running 123m
4. attach tags to VMs on vSphere UI
5. check machines zone and region being attached
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-vs411-d5vqm-master-0 Running tagregion tagzone 4h19m
huliu-vs411-d5vqm-master-1 Running tagregion tagzone 4h19m
huliu-vs411-d5vqm-master-2 Running tagregion tagzone 4h19m
huliu-vs411-d5vqm-worker-gmwww Running tagregion tagzone 4h15m
huliu-vs411-d5vqm-worker-zfn9p Running tagregion tagzone 4h15m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |
Description of problem: adding the [Labels] section to the vsphere cloud provider configuration makes the nodes NotReady. Reproduction steps: 1) install an OCP cluster on vSphere (IPI or UPI) 2) edit the configMap cloud-provider-config adding at the end the [Labels] section ~~~ oc edit cm cloud-provider-config -n openshift-config [...] [Labels] region = k8s-region zone = k8s-zone [...] 3) wait some minutes a new MachineSet is deployed to apply the new cloud.conf 4) first node reboots and become NotReady kubelet logs show: ~~~ failed connecting to vcServer "xxxxxxx" with error ServerFaultCode: Cannot complete login due to an incorrect username or password. ~~~ Actual results: Nodes become NotReady, cloud provider is broken Expected results: Nodes should came up properly with the rights labels reflecting the vSphere tags Additional info: - seems to be related to an upstream Kubernetes issue[1]: if the cloud.conf file is using a secret to keep vCenter credentials the labels cannot be retrieved and the cloud provider fails. - Using an IPI installation the Machines are properly labelled but the issue perists [1] https://github.com/kubernetes/kubernetes/issues/75175