Description of problem: adding the [Labels] section to the vsphere cloud provider configuration makes the nodes NotReady. Reproduction steps: 1) install an OCP cluster on vSphere (IPI or UPI) 2) edit the configMap cloud-provider-config adding at the end the [Labels] section ~~~ oc edit cm cloud-provider-config -n openshift-config [...] [Labels] region = k8s-region zone = k8s-zone [...] 3) wait some minutes a new MachineSet is deployed to apply the new cloud.conf 4) first node reboots and become NotReady kubelet logs show: ~~~ failed connecting to vcServer "xxxxxxx" with error ServerFaultCode: Cannot complete login due to an incorrect username or password. ~~~ Actual results: Nodes become NotReady, cloud provider is broken Expected results: Nodes should came up properly with the rights labels reflecting the vSphere tags Additional info: - seems to be related to an upstream Kubernetes issue[1]: if the cloud.conf file is using a secret to keep vCenter credentials the labels cannot be retrieved and the cloud provider fails. - Using an IPI installation the Machines are properly labelled but the issue perists [1] https://github.com/kubernetes/kubernetes/issues/75175
There has been some motion on the upstream issue recently, looks like a fix may be in the pipeline. I suggest we wait for the moment to see if anything happens there
https://github.com/kubernetes/kubernetes/pull/101028
We believe that this issue should be resolved as part of the out of tree cloud provider migration. We are currently aiming for a technical preview for vSphere in 4.10. Until then, we will try to mitigate the issue as much as possible via the proposed upstream patch, this won't fully resolve the issue however.
We need to find someone upstream from the vSphere community to review the upstream PR. Nothing will be happening downstream with this for now.
@Denis, when you are back, could you please take a look at the upstream PR, there was some feedback from cheftako that hasn't been address. Perhaps if we can get those comments addressed we can make some progress on this for the next release
No new feedback/comments there. upstream PR still waiting for some meaningful reviews.
*** Bug 2009037 has been marked as a duplicate of this bug. ***
The upstream PR has merged, this will be included in the cloud provider code once a rebase to 1.24 happens. Nothing we can do with this bug until the rebase occurs in a couple of sprints
This is now waiting on the rebase to merge
We need to set up a RBAC for the fix within KCMO
Verified on 4.11.0-0.nightly-2022-06-21-151125 Steps: 1. install an OCP cluster on vSphere liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-21-151125 True False 19m Cluster version is 4.11.0-0.nightly-2022-06-21-151125 liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-vs411-d5vqm-master-0 Ready master 41m v1.24.0+284d62a huliu-vs411-d5vqm-master-1 Ready master 41m v1.24.0+284d62a huliu-vs411-d5vqm-master-2 Ready master 41m v1.24.0+284d62a huliu-vs411-d5vqm-worker-gmwww Ready worker 29m v1.24.0+284d62a huliu-vs411-d5vqm-worker-zfn9p Ready worker 29m v1.24.0+284d62a liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs411-d5vqm-master-0 Running 42m huliu-vs411-d5vqm-master-1 Running 42m huliu-vs411-d5vqm-master-2 Running 42m huliu-vs411-d5vqm-worker-gmwww Running 39m huliu-vs411-d5vqm-worker-zfn9p Running 39m 2. edit the configMap cloud-provider-config adding [Labels] section liuhuali@Lius-MacBook-Pro huali-test % oc edit cm cloud-provider-config -n openshift-config configmap/cloud-provider-config edited ... [Labels] region = k8s-region zone = k8s-zone ... 3. wait all nodes restart and get Ready again. liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-vs411-d5vqm-master-0 Ready master 117m v1.24.0+284d62a huliu-vs411-d5vqm-master-1 Ready master 117m v1.24.0+284d62a huliu-vs411-d5vqm-master-2 Ready,SchedulingDisabled master 117m v1.24.0+284d62a huliu-vs411-d5vqm-worker-gmwww Ready worker 105m v1.24.0+284d62a huliu-vs411-d5vqm-worker-zfn9p Ready worker 105m v1.24.0+284d62a liuhuali@Lius-MacBook-Pro huali-test % liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-vs411-d5vqm-master-0 Ready master 125m v1.24.0+284d62a huliu-vs411-d5vqm-master-1 Ready master 125m v1.24.0+284d62a huliu-vs411-d5vqm-master-2 Ready master 125m v1.24.0+284d62a huliu-vs411-d5vqm-worker-gmwww Ready worker 113m v1.24.0+284d62a huliu-vs411-d5vqm-worker-zfn9p Ready worker 113m v1.24.0+284d62a liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs411-d5vqm-master-0 Running 126m huliu-vs411-d5vqm-master-1 Running 126m huliu-vs411-d5vqm-master-2 Running 126m huliu-vs411-d5vqm-worker-gmwww Running 123m huliu-vs411-d5vqm-worker-zfn9p Running 123m 4. attach tags to VMs on vSphere UI 5. check machines zone and region being attached liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs411-d5vqm-master-0 Running tagregion tagzone 4h19m huliu-vs411-d5vqm-master-1 Running tagregion tagzone 4h19m huliu-vs411-d5vqm-master-2 Running tagregion tagzone 4h19m huliu-vs411-d5vqm-worker-gmwww Running tagregion tagzone 4h15m huliu-vs411-d5vqm-worker-zfn9p Running tagregion tagzone 4h15m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069