Bug 2091642

Summary:	[WINC] Windows machineset can't scale up when publicIP is set to false on Azure
Product:	OpenShift Container Platform	Reporter:	Jose Luis Franco <jfrancoa>
Component:	Windows Containers	Assignee:	Team Windows Containers <team-winc-bot>
Status:	CLOSED DEFERRED	QA Contact:	Jose Luis Franco <jfrancoa>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.10	CC:	jvaldes, mburke, mohashai
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Cause: Not known Consequence: Windows machineset can't scale up Workaround (if any): set publicIp: true in machineset Result:	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-03-09 01:20:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jose Luis Franco 2022-05-30 15:04:32 UTC

Must gather logs:

1. Issue: Windows Machines can't scale up when publicIP machineset field is set to false on Azure. When the machineset is created, the machines are created successfully. However, when trying to scale up the very same machineset the newly provisioned machine hangs on Provisioned state.
[jfrancoa@localhost 107382]$ oc get machine -n openshift-machine-api
NAME                                            PHASE         TYPE              REGION   ZONE   AGE
jfrancoa-3005-azure-qf9hb-master-0              Running       Standard_D8s_v3   westus          9h
jfrancoa-3005-azure-qf9hb-master-1              Running       Standard_D8s_v3   westus          9h
jfrancoa-3005-azure-qf9hb-master-2              Running       Standard_D8s_v3   westus          9h
jfrancoa-3005-azure-qf9hb-worker-westus-m6sxs   Running       Standard_D4s_v3   westus          9h
jfrancoa-3005-azure-qf9hb-worker-westus-tpnw4   Running       Standard_D4s_v3   westus          9h
jfrancoa-3005-azure-qf9hb-worker-westus-xhvzc   Running       Standard_D4s_v3   westus          9h
win-d6k7v                                       Running       Standard_D2s_v3   westus          54m
win-hm2l2                                       Running       Standard_D2s_v3   westus          14m
win-xxl8p                                       Running       Standard_D2s_v3   westus          72m
windows-r67jd                                   Running       Standard_D2s_v3   westus          8h
windows-tw5pg                                   Provisioned   Standard_D2s_v3   westus          8h
windows-txxcf                                   Running       Standard_D2s_v3   westus          8h

2. WMCO & OpenShift Version:
 [jfrancoa@localhost 107382]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-05-26-102501   True        False         8h      Cluster version is 4.10.0-0.nightly-2022-05-26-102501

[jfrancoa@localhost 107382]$ oc get csv -n openshift-windows-machine-config-operator
NAME                                     DISPLAY                            VERSION   REPLACES                                 PHASE
elasticsearch-operator.5.4.2             OpenShift Elasticsearch Operator   5.4.2                                              Succeeded
windows-machine-config-operator.v5.1.0   Windows Machine Config Operator    5.1.0     windows-machine-config-operator.v5.0.0   Succeeded

3. Platform - Azure

5. Is it a new test case or an old test case?
   if it is the old test case, is it regression or first-time tested? 
   Is it platform-specific or consistent across all platforms?
It impacts an old test case, however I believe this was not observed before.

6. Steps to Reproduce
* Create a OCP 4.10 cluster, install WMCO 5.1.0 and create a machineset following the docs.
* Make sure the machines got propèrly created
* Scale up the number of machines to one more: oc scale --replicas=n+1 machineset <name> -n openshift-machine-api
* Wait for the machine to get into running state (which does not happen even after waiting for hours)
7. Actual Result and Expected Result

The Windows machines from the machineset can be scaled up properly.

8. A possible workaround has been tried? Is there a way to recover from the issue being tried out?

Setting the publiIP field to true does solve the issue. I realized that even though the docs suggest the creation of machinesets with publicIP set to false: https://docs.openshift.com/container-platform/4.10/windows_containers/creating_windows_machinesets/creating-windows-machineset-azure.html#windows-machineset-azure_creating-windows-machineset-azure all the machines created during the machineset creation did have a publicIP on Azure. However, the scaled up nodes which never got to Running state were missing the publicIP. Therefore, I created a second machineset with the publicIP field set to true and tried to scale it up, with no issues at all:
[jfrancoa@localhost 107382]$ oc get machineset -n openshift-machine-api
NAME                                      DESIRED   CURRENT   READY   AVAILABLE   AGE
jfrancoa-3005-azure-qf9hb-worker-westus   3         3         3       3           9h
win                                       3         3         3       3           89m
windows                                   3         3         2       2           8h

win -> publicIP = true
windows -> publicIP = false

9. Logs
       Must-gather-windows-node-logs(https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_windows_node_logs#L24)
           oc get network.operator cluster -o yaml
           oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator
       Windows MachineSet yaml or windows-instances ConfigMap
           oc get machineset <windows_machineSet_name> -n openshift-machine-api -o yaml
           oc get configmaps <windows_configmap_name> -n <namespace_name> -o yaml


 Optional logs:
    Anything that can be useful to debug the issue.

Comment 4 Joel Speed 2022-06-01 08:04:39 UTC

Looking at the provided must gathers, there are three machines in the machineset which have publicIP: false, these are windows-r67jq, windows-tw5pg and windows-txxcf. Of those, only windows-tw5pg isn't running. 

According to the Machine status, the instance was created and is running in the cloud provider, it looks to me like the issue is somewhere on the software side, ie, kubelet hasn't requested a CSR for a client certificate, meaning it probably never started.

As the other Machines are Running, I would suggest this was some transient failure, though I know this happens a lot on linux too.

This either needs to be investigated by the team who owns WMCO or the teams who own MCO/Kubelet.

Gonna send this to the WMCO for now as I suspect, given this is a windows node, they should probably try to debug it first.

In the mean time, is there any way to try and gather logs from the Machine that never became running?

Comment 5 Jose Luis Franco 2022-06-01 15:40:16 UTC

Having a debugging session with Sebastian, we did observe some highly probable issue with the user_data in those scaled up machines. The machine hanging in Provisioned state, didn't have even sshd configured:

> sc.exe query sshd
[SC] EnumQueryServicesStatus:OpenService FAILED 1060:

The specified service does not exist as an installed service.

Using the Azure AZ CLI, we can see that the scaled up nodes have the userData field set to null, but that's something which occurs in the nodes which provisioned successfully upon machineset creation:

      "name": "windows-hpq54_OSDisk",
      "osType": "Windows",
      "vhd": null,
      "writeAcceleratorEnabled": null
    }
  },
  "tags": {
    "kubernetes.io-cluster-jfrancoa-0106-azure-4dpxz": "owned"
  },
  "timeCreated": "2022-06-01T13:29:23.441806+00:00",
  "type": "Microsoft.Compute/virtualMachines",
  "userData": null,
  "virtualMachineScaleSet": null,
  "vmId": "95d97d25-74d3-4368-b19e-e9366c5613b6",
  "zones": null
}

Something observed is that a Public IP from a Load Balancer is assigned to every single node created by the machineset (except to the one that fails and stays in Provisioned state). And when checking the Backend pool rules, the host which stayes in Provisioned state is missing from the rules:

* Node in provisioned during our debugging session -> windows-hpq54                                   Provisioned   Standard_D2s_v3   westus          125m

[jfrancoa@localhost 108317]$ oc get machine -n openshift-machine-api
NAME                                            PHASE         TYPE              REGION   ZONE   AGE
jfrancoa-0106-azure-4dpxz-master-0              Running       Standard_D8s_v3   westus          4h38m
jfrancoa-0106-azure-4dpxz-master-1              Running       Standard_D8s_v3   westus          4h38m
jfrancoa-0106-azure-4dpxz-master-2              Running       Standard_D8s_v3   westus          4h38m
jfrancoa-0106-azure-4dpxz-worker-westus-bjs26   Running       Standard_D4s_v3   westus          4h28m
jfrancoa-0106-azure-4dpxz-worker-westus-kkd76   Running       Standard_D4s_v3   westus          4h28m
jfrancoa-0106-azure-4dpxz-worker-westus-vvwht   Running       Standard_D4s_v3   westus          4h28m
win-5cdvl                                       Running       Standard_D2s_v3   westus          94m
win-87m6b                                       Running       Standard_D2s_v3   westus          94m
win-rd8nt                                       Running       Standard_D2s_v3   westus          68m
win-rnkvx                                       Running       Standard_D2s_v3   westus          94m
windows-4z84l                                   Running       Standard_D2s_v3   westus          3h59m
windows-b58vv                                   Running       Standard_D2s_v3   westus          3h59m
windows-hpq54                                   Provisioned   Standard_D2s_v3   westus          125m

* Load balancer rules:

jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-master-1
Running
10.0.0.6
jfrancoa-0106-azure-4dpxz-master-1-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-master-0
Running
10.0.0.8
jfrancoa-0106-azure-4dpxz-master-0-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-master-2
Running
10.0.0.7
jfrancoa-0106-azure-4dpxz-master-2-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-worker-westus-vvwht
Running
10.0.128.4
jfrancoa-0106-azure-4dpxz-worker-westus-vvwht-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-worker-westus-bjs26
Running
10.0.128.5
jfrancoa-0106-azure-4dpxz-worker-westus-bjs26-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-worker-westus-kkd76
Running
10.0.128.6
jfrancoa-0106-azure-4dpxz-worker-westus-kkd76-nic
6
jfrancoa-0106-azure-4dpxz
windows-b58vv
Running
10.0.128.7
windows-b58vv-nic
6
jfrancoa-0106-azure-4dpxz
windows-4z84l
Running
10.0.128.8
windows-4z84l-nic
6
jfrancoa-0106-azure-4dpxz
win-rnkvx
Running
10.0.128.10
win-rnkvx-nic
6
jfrancoa-0106-azure-4dpxz
win-5cdvl
Running
10.0.128.11
win-5cdvl-nic
6
jfrancoa-0106-azure-4dpxz
win-87m6b
Running
10.0.128.12
win-87m6b-nic
6
jfrancoa-0106-azure-4dpxz
win-rd8nt
Running
10.0.128.13
win-rd8nt-nic
6

I have appended to the bug the output from the load balancer show AZ CLI command.

Comment 7 Jose Luis Franco 2022-06-02 08:32:08 UTC

I could confirm the behavior stated in comment #5. In OCP 4.11 this issue does not occur, the scaled up machine from the machine set has a backend rule added to the Azure loadbalancer and thus, can be configured by WMCO.
In OCP 4.10, the scaled up machine is not included in the load balancer rules and can't be ssh'ed. The question is, can't it be ssh'ed because of the lack of a public IP (from the load balancer) or because the user data injection failed (as it uses the load balancer's IP to push the user_data instead of the private IP) and therefore, sshd, nor the key are present in the Windows node to be able to access it.

Comment 9 elango siva 2022-06-09 19:26:06 UTC

This is a known issue and will be fixing future releases.

Comment 12 jvaldes 2022-08-23 16:24:27 UTC

Still relevant and will be planned in a future sprint.

Comment 13 Mohammad Saif Shaikh 2022-09-08 20:16:50 UTC

> scaled up nodes have the userData field set to null

The WinC team reviewed this and are increasingly convinced it is not an issue in WMCO's responsibility.

Comment 15 Aravindh Puthiyaparambil 2023-03-03 17:24:51 UTC

@jfrancoa does this issue occur with a Linux MachineSet also?

Comment 16 Jose Luis Franco 2023-03-06 10:36:36 UTC

No, it does not happen with the Linux MachineSet. Only with the Windows one:

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get machineset -n openshift-machine-api
NAME                              DESIRED   CURRENT   READY   AVAILABLE   AGE
jfrancoa-73-wjlkc-worker-westus   3         3         3       3           129m
windows                           2         2         2       2           94m
[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get nodes
NAME                                    STATUS   ROLES                  AGE    VERSION
jfrancoa-73-wjlkc-master-0              Ready    control-plane,master   126m   v1.26.1+8cfbab7
jfrancoa-73-wjlkc-master-1              Ready    control-plane,master   126m   v1.26.1+8cfbab7
jfrancoa-73-wjlkc-master-2              Ready    control-plane,master   126m   v1.26.1+8cfbab7
jfrancoa-73-wjlkc-worker-westus-6b52f   Ready    worker                 112m   v1.26.1+8cfbab7
jfrancoa-73-wjlkc-worker-westus-cgrnf   Ready    worker                 111m   v1.26.1+8cfbab7
jfrancoa-73-wjlkc-worker-westus-jff9g   Ready    worker                 112m   v1.26.1+8cfbab7
windows-6tkqn                           Ready    worker                 57m    v1.26.0+f854081
windows-z848c                           Ready    worker                 61m    v1.26.0+f854081
[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc scale machineset jfrancoa-73-wjlkc-worker-westus -n openshift-machine-api  --replicas=4
machineset.machine.openshift.io/jfrancoa-73-wjlkc-worker-westus scaled
[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc scale machineset windows -n openshift-machine-api  --replicas=3
machineset.machine.openshift.io/windows scaled
#### WAIT MORE THAN 30 MINUTES ######
[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get nodes
NAME                                    STATUS   ROLES                  AGE    VERSION
jfrancoa-73-wjlkc-master-0              Ready    control-plane,master   3h     v1.26.1+8cfbab7
jfrancoa-73-wjlkc-master-1              Ready    control-plane,master   3h     v1.26.1+8cfbab7
jfrancoa-73-wjlkc-master-2              Ready    control-plane,master   3h     v1.26.1+8cfbab7
jfrancoa-73-wjlkc-worker-westus-6b52f   Ready    worker                 167m   v1.26.1+8cfbab7
jfrancoa-73-wjlkc-worker-westus-cgrnf   Ready    worker                 166m   v1.26.1+8cfbab7
jfrancoa-73-wjlkc-worker-westus-jff9g   Ready    worker                 166m   v1.26.1+8cfbab7
jfrancoa-73-wjlkc-worker-westus-mhm7x   Ready    worker                 48m    v1.26.1+8cfbab7
windows-6tkqn                           Ready    worker                 111m   v1.26.0+f854081
windows-z848c                           Ready    worker                 115m   v1.26.0+f854081
[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get machines -n openshift-machine-api
NAME                                    PHASE         TYPE              REGION   ZONE   AGE
jfrancoa-73-wjlkc-master-0              Running       Standard_D8s_v3   westus          3h5m
jfrancoa-73-wjlkc-master-1              Running       Standard_D8s_v3   westus          3h5m
jfrancoa-73-wjlkc-master-2              Running       Standard_D8s_v3   westus          3h5m
jfrancoa-73-wjlkc-worker-westus-6b52f   Running       Standard_D4s_v3   westus          176m
jfrancoa-73-wjlkc-worker-westus-cgrnf   Running       Standard_D4s_v3   westus          176m
jfrancoa-73-wjlkc-worker-westus-jff9g   Running       Standard_D4s_v3   westus          176m
jfrancoa-73-wjlkc-worker-westus-mhm7x   Running       Standard_D4s_v3   westus          54m
windows-6tkqn                           Running       Standard_D2s_v3   westus          124m
windows-nsgtf                           Provisioned   Standard_D2s_v3   westus          54m
windows-z848c                           Running       Standard_D2s_v3   westus          124m

Comment 17 Shiftzilla 2023-03-09 01:20:21 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9292