Bug 2102945

Summary: Windows nodes' kubelet cannot run with --cloud-provider=external after migrate to CCM
Product: OpenShift Container Platform Reporter: Huali Liu <huliu>
Component: Windows ContainersAssignee: Team Windows Containers <team-winc-bot>
Status: CLOSED DEFERRED QA Contact: Ronnie Rasouli <rrasouli>
Severity: high Docs Contact:
Priority: medium    
Version: 4.11CC: mankulka, mohashai
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:23:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Huali Liu 2022-07-01 05:03:55 UTC
Description of problem:
Install a fresh cluster, add windows worker, then enable ccm, Check Windows nodes' kubelet cannot run with --cloud-provider=external
But if install a fresh cluster with ccm, then add windows worker, Check Windows nodes' kubelet run with --cloud-provider=external as expected

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-30-005428

How reproducible:
Always

Steps to Reproduce:
1. Install a fresh cluster, add windows worker
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-30-005428   True        False         41m     Cluster version is 4.11.0-0.nightly-2022-06-30-005428
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                                STATUS   ROLES    AGE   VERSION
huliu-azure71a-4wmh2-master-0                       Ready    master   71m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-master-1                       Ready    master   71m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-master-2                       Ready    master   71m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-worker-southcentralus1-9d6lb   Ready    worker   57m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-worker-southcentralus2-mqpqq   Ready    worker   54m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-worker-southcentralus3-md7pc   Ready    worker   57m   v1.24.0+9ddc8b1
windows-bgpxw                                       Ready    worker   27m   v1.24.0-2323+01aa0f3f6052c9
windows-dz85l                                       Ready    worker   21m   v1.24.0-2323+01aa0f3f6052c9

2. enable ccm
liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate
featuregate.config.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get deploy -n openshift-cloud-controller-manager         
NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
azure-cloud-controller-manager   2/2     2            2           10m
liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cloud-controller-manager         
NAME                                              READY   STATUS    RESTARTS   AGE
azure-cloud-controller-manager-5946ff4bb9-6hc5k   1/1     Running   0          10m
azure-cloud-controller-manager-5946ff4bb9-qdsb7   1/1     Running   0          10m
azure-cloud-node-manager-62srs                    1/1     Running   0          9m44s
azure-cloud-node-manager-72gjm                    1/1     Running   0          10m
azure-cloud-node-manager-k4bdb                    1/1     Running   0          6m50s
azure-cloud-node-manager-tlk4c                    1/1     Running   0          10m
azure-cloud-node-manager-tpwhc                    1/1     Running   0          10m
azure-cloud-node-manager-vpgw6                    1/1     Running   0          10m
liuhuali@Lius-MacBook-Pro huali-test % oc get node                                                 
NAME                                                STATUS   ROLES    AGE   VERSION
huliu-azure71a-4wmh2-master-0                       Ready    master   99m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-master-1                       Ready    master   98m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-master-2                       Ready    master   99m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-worker-southcentralus1-9d6lb   Ready    worker   84m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-worker-southcentralus2-mqpqq   Ready    worker   81m   v1.24.0+9ddc8b1
huliu-azure71a-4wmh2-worker-southcentralus3-md7pc   Ready    worker   84m   v1.24.0+9ddc8b1
windows-bgpxw                                       Ready    worker   54m   v1.24.0-2323+01aa0f3f6052c9
windows-dz85l                                       Ready    worker   48m   v1.24.0-2323+01aa0f3f6052c9

3. Ssh to windows node
liuhuali@Lius-MacBook-Pro huali-test % oc debug node/huliu-azure71a-4wmh2-master-0              
W0701 11:40:06.697694   61245 warnings.go:70] would violate PodSecurity "restricted:v1.24": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
Starting pod/huliu-azure71a-4wmh2-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.7
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cd ~
sh-4.4# ssh -i /tmp/openshift-qe.pem capi.128.7 powershell
Windows PowerShell 
Copyright (C) Microsoft Corporation. All rights reserved.

PS C:\Users\capi> Get-Item -path HKLM:HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kubelet
Get-Item -path HKLM:HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kubelet


    Hive: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services


Name                           Property                                                                                
----                           --------                                                                                
kubelet                        Type            : 16                                                                    
                               Start           : 2                                                                     
                               ErrorControl    : 1                                                                     
                               ImagePath       : c:\k\kubelet.exe --config=c:\k\kubelet.conf                           
                               --bootstrap-kubeconfig=c:\k\bootstrap-kubeconfig                                        
                                                 --kubeconfig=c:\k\kubeconfig --cert-dir=c:\var\lib\kubelet\pki\       
                               --windows-service                                                                       
                                                 --logtostderr=false --log-file=C:\var\log\kubelet\kubelet.log         
                                                 --register-with-taints=os=Windows:NoSchedule                          
                               --node-labels=node.openshift.io/os_id=Windows                                           
                                                 --container-runtime=remote                                            
                               --container-runtime-endpoint=npipe://./pipe/containerd-containerd                       
                                                 --resolv-conf= --cloud-provider=azure --v=3                           
                               --cloud-config=c:\k\cloud.conf                                                          
                               DependOnService : {containerd}                                                          
                               ObjectName      : LocalSystem                                                           
                               Description     : OpenShift managed kubelet                                             
                               FailureActions  : {88, 2, 0, 0...}                                                      


PS C:\Users\capi> Get-Service cloud-node-manager
Get-Service cloud-node-manager
Get-Service : Cannot find any service with service name 'cloud-node-manager'.
At line:1 char:1
+ Get-Service cloud-node-manager
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (cloud-node-manager:String) [Get-Service], ServiceCommandException
    + FullyQualifiedErrorId : NoServiceFoundForGivenName,Microsoft.PowerShell.Commands.GetServiceCommand
 
PS C:\Users\capi> 


Actual results:
kubelet run with --cloud-provider=azure; no cloud-node-manager service.

Expected results:
kubelet run with --cloud-provider=external; Should have cloud-node-manager.

Additional info:
Checked on aws(4.11.0-0.nightly-2022-06-30-005428), vsphere(4.11.0-0.nightly-2022-06-30-005428), azure(4.10.0-fc.0, 4.10.0-0.nightly-2022-06-08-150219, 4.11.0-0.nightly-2022-06-30-005428), all can reproduce this issue.

Also checked on aws(4.11.0-0.nightly-2022-06-30-005428), vsphere(4.11.0-0.nightly-2022-06-30-005428), azure(4.11.0-0.nightly-2022-06-30-005428), install a fresh cluster with ccm, then add windows worker, Check Windows nodes' kubelet run with --cloud-provider=external as expected.

PS C:\Users\capi> Get-Service cloud-node-manager
Get-Service cloud-node-manager

Status   Name               DisplayName                           
------   ----               -----------                           
Running  cloud-node-manager cloud-node-manager                    


PS C:\Users\capi> Get-Item -path HKLM:HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kubelet
Get-Item -path HKLM:HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kubelet


    Hive: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services


Name                           Property                                                                                
----                           --------                                                                                
kubelet                        Type            : 16                                                                    
                               Start           : 2                                                                     
                               ErrorControl    : 1                                                                     
                               ImagePath       : c:\k\kubelet.exe --config=c:\k\kubelet.conf                           
                               --bootstrap-kubeconfig=c:\k\bootstrap-kubeconfig                                        
                                                 --kubeconfig=c:\k\kubeconfig --cert-dir=c:\var\lib\kubelet\pki\       
                               --windows-service                                                                       
                                                 --logtostderr=false --log-file=C:\var\log\kubelet\kubelet.log         
                                                 --register-with-taints=os=Windows:NoSchedule                          
                               --node-labels=node.openshift.io/os_id=Windows                                           
                                                 --container-runtime=remote                                            
                               --container-runtime-endpoint=npipe://./pipe/containerd-containerd                       
                                                 --resolv-conf= --cloud-provider=external --v=3                        
                               DependOnService : {containerd}                                                          
                               ObjectName      : LocalSystem                                                           
                               Description     : OpenShift managed kubelet                                             
                               FailureActions  : {88, 2, 0, 0...}                                                      


PS C:\Users\capi> 


Must-gather: 
azure(install a fresh cluster, add windows worker, then enable ccm) - https://drive.google.com/file/d/1N2InQFe_mDIqayfUCqMyP-U8OE-2wss2/view?usp=sharing
azure(install a fresh cluster with ccm, then add windows worker) - https://drive.google.com/file/d/1iHR9LzQuCmwtRxsVz6oBMGCCJrJFwTYC/view?usp=sharing

Comment 1 Joel Speed 2022-07-04 15:11:16 UTC
On discussion with Mikhail, this seems to be a limitation of WMCO. WMCO applies the configuration on create but isn't updating the configuration. Please can the WMCO team confirm that this is a limitation and let us know if they need help resolving it, we have an interest in seeing this working by the end of 4.12.

Comment 2 Mohammad Saif Shaikh 2022-09-08 20:12:18 UTC
The team discussed and this is indeed a limitation of WMCO. Prioritization of this work is yet to be done.

Comment 3 Shiftzilla 2023-03-09 01:23:03 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9356