Bug 1902307

Summary: [vSphere] cloud labels management via cloud provider makes nodes not ready
Product: OpenShift Container Platform Reporter: Pietro Bertera <pbertera>
Component: Cloud ComputeAssignee: dmoiseev
Cloud Compute sub component: Cloud Controller Manager QA Contact: Huali Liu <huliu>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, dmoiseev, mfedosin, mimccune, rkant
Version: 4.6   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Getting nodes zone labels requires to contact vCenter for obtain that labels values. Due to kubelet tries to do so in a very early initialisation steps it can not read vCenter credentials from the secrets. Consequence: In case if vCenter credentials stored in secret and regions/zone parameters are presented in cloud.conf, kubelet can not start, due to lack of credentials for vCenter for obtaining zone/region label values. Fix: For vSphere platform with set up secret region and zone labels population was moved out of kubelet initialization sequence to the kube-controller-manager part of cloud provider code. Result: Region and zone labels works properly and do not cause kubelet hanging with credentials in secret now.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:35:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pietro Bertera 2020-11-27 16:37:43 UTC
Description of problem:

adding the [Labels] section to the vsphere cloud provider configuration makes the nodes NotReady.

Reproduction steps:

1) install an OCP cluster on vSphere (IPI or UPI)

2) edit the configMap cloud-provider-config adding at the end the [Labels] section

~~~
oc edit cm cloud-provider-config -n openshift-config
[...]
 [Labels]
    region = k8s-region
    zone = k8s-zone 
[...]

3) wait some minutes a new MachineSet is deployed to apply the new cloud.conf
4) first node reboots and become NotReady kubelet logs show:

~~~
failed connecting to vcServer "xxxxxxx" with error ServerFaultCode: Cannot complete login due to an incorrect username or password.
~~~

Actual results:

Nodes become NotReady, cloud provider is broken

Expected results:

Nodes should came up properly with the rights labels reflecting the vSphere tags

Additional info:

- seems to be related to an upstream Kubernetes issue[1]: if the cloud.conf file is using a secret to keep vCenter credentials the labels cannot be retrieved and the cloud provider fails.

- Using an IPI installation the Machines are properly labelled but the issue perists

[1] https://github.com/kubernetes/kubernetes/issues/75175

Comment 4 Joel Speed 2021-02-08 10:16:56 UTC
There has been some motion on the upstream issue recently, looks like a fix may be in the pipeline. I suggest we wait for the moment to see if anything happens there

Comment 7 Joel Speed 2021-05-19 14:21:09 UTC
We believe that this issue should be resolved as part of the out of tree cloud provider migration.
We are currently aiming for a technical preview for vSphere in 4.10.

Until then, we will try to mitigate the issue as much as possible via the proposed upstream patch, this won't fully resolve the issue however.

Comment 8 Joel Speed 2021-06-09 10:52:15 UTC
We need to find someone upstream from the vSphere community to review the upstream PR. Nothing will be happening downstream with this for now.

Comment 9 Joel Speed 2021-08-19 10:21:14 UTC
@Denis, when you are back, could you please take a look at the upstream PR, there was some feedback from cheftako that hasn't been address. Perhaps if we can get those comments addressed we can make some progress on this for the next release

Comment 10 dmoiseev 2021-08-24 10:09:58 UTC
No new feedback/comments there. upstream PR still waiting for some meaningful reviews.

Comment 11 Joel Speed 2021-10-12 11:47:51 UTC
*** Bug 2009037 has been marked as a duplicate of this bug. ***

Comment 15 Joel Speed 2022-03-09 12:52:00 UTC
The upstream PR has merged, this will be included in the cloud provider code once a rebase to 1.24 happens. Nothing we can do with this bug until the rebase occurs in a couple of sprints

Comment 16 Joel Speed 2022-05-24 09:57:40 UTC
This is now waiting on the rebase to merge

Comment 17 Joel Speed 2022-05-26 13:55:30 UTC
We need to set up a RBAC for the fix within KCMO

Comment 20 Huali Liu 2022-06-22 05:53:50 UTC
Verified on 4.11.0-0.nightly-2022-06-21-151125

Steps:
1. install an OCP cluster on vSphere
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-21-151125   True        False         19m     Cluster version is 4.11.0-0.nightly-2022-06-21-151125
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                             STATUS   ROLES    AGE   VERSION
huliu-vs411-d5vqm-master-0       Ready    master   41m   v1.24.0+284d62a
huliu-vs411-d5vqm-master-1       Ready    master   41m   v1.24.0+284d62a
huliu-vs411-d5vqm-master-2       Ready    master   41m   v1.24.0+284d62a
huliu-vs411-d5vqm-worker-gmwww   Ready    worker   29m   v1.24.0+284d62a
huliu-vs411-d5vqm-worker-zfn9p   Ready    worker   29m   v1.24.0+284d62a
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                             PHASE     TYPE   REGION   ZONE   AGE
huliu-vs411-d5vqm-master-0       Running                          42m
huliu-vs411-d5vqm-master-1       Running                          42m
huliu-vs411-d5vqm-master-2       Running                          42m
huliu-vs411-d5vqm-worker-gmwww   Running                          39m
huliu-vs411-d5vqm-worker-zfn9p   Running                          39m

2. edit the configMap cloud-provider-config adding [Labels] section
liuhuali@Lius-MacBook-Pro huali-test % oc edit cm cloud-provider-config -n openshift-config
configmap/cloud-provider-config edited
...
    [Labels]
    region = k8s-region
    zone = k8s-zone 
...

3. wait all nodes restart and get Ready again.
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                             STATUS                     ROLES    AGE    VERSION
huliu-vs411-d5vqm-master-0       Ready                      master   117m   v1.24.0+284d62a
huliu-vs411-d5vqm-master-1       Ready                      master   117m   v1.24.0+284d62a
huliu-vs411-d5vqm-master-2       Ready,SchedulingDisabled   master   117m   v1.24.0+284d62a
huliu-vs411-d5vqm-worker-gmwww   Ready                      worker   105m   v1.24.0+284d62a
huliu-vs411-d5vqm-worker-zfn9p   Ready                      worker   105m   v1.24.0+284d62a
liuhuali@Lius-MacBook-Pro huali-test % 
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                             STATUS   ROLES    AGE    VERSION
huliu-vs411-d5vqm-master-0       Ready    master   125m   v1.24.0+284d62a
huliu-vs411-d5vqm-master-1       Ready    master   125m   v1.24.0+284d62a
huliu-vs411-d5vqm-master-2       Ready    master   125m   v1.24.0+284d62a
huliu-vs411-d5vqm-worker-gmwww   Ready    worker   113m   v1.24.0+284d62a
huliu-vs411-d5vqm-worker-zfn9p   Ready    worker   113m   v1.24.0+284d62a
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                             PHASE     TYPE   REGION   ZONE   AGE
huliu-vs411-d5vqm-master-0       Running                          126m
huliu-vs411-d5vqm-master-1       Running                          126m
huliu-vs411-d5vqm-master-2       Running                          126m
huliu-vs411-d5vqm-worker-gmwww   Running                          123m
huliu-vs411-d5vqm-worker-zfn9p   Running                          123m

4. attach tags to VMs on vSphere UI

5. check machines zone and region being attached
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                             PHASE     TYPE   REGION      ZONE      AGE
huliu-vs411-d5vqm-master-0       Running          tagregion   tagzone   4h19m
huliu-vs411-d5vqm-master-1       Running          tagregion   tagzone   4h19m
huliu-vs411-d5vqm-master-2       Running          tagregion   tagzone   4h19m
huliu-vs411-d5vqm-worker-gmwww   Running          tagregion   tagzone   4h15m
huliu-vs411-d5vqm-worker-zfn9p   Running          tagregion   tagzone   4h15m

Comment 22 errata-xmlrpc 2022-08-10 10:35:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069