Must gather logs: 1. Issue: Windows Machines can't scale up when publicIP machineset field is set to false on Azure. When the machineset is created, the machines are created successfully. However, when trying to scale up the very same machineset the newly provisioned machine hangs on Provisioned state. [jfrancoa@localhost 107382]$ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jfrancoa-3005-azure-qf9hb-master-0 Running Standard_D8s_v3 westus 9h jfrancoa-3005-azure-qf9hb-master-1 Running Standard_D8s_v3 westus 9h jfrancoa-3005-azure-qf9hb-master-2 Running Standard_D8s_v3 westus 9h jfrancoa-3005-azure-qf9hb-worker-westus-m6sxs Running Standard_D4s_v3 westus 9h jfrancoa-3005-azure-qf9hb-worker-westus-tpnw4 Running Standard_D4s_v3 westus 9h jfrancoa-3005-azure-qf9hb-worker-westus-xhvzc Running Standard_D4s_v3 westus 9h win-d6k7v Running Standard_D2s_v3 westus 54m win-hm2l2 Running Standard_D2s_v3 westus 14m win-xxl8p Running Standard_D2s_v3 westus 72m windows-r67jd Running Standard_D2s_v3 westus 8h windows-tw5pg Provisioned Standard_D2s_v3 westus 8h windows-txxcf Running Standard_D2s_v3 westus 8h 2. WMCO & OpenShift Version: [jfrancoa@localhost 107382]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-05-26-102501 True False 8h Cluster version is 4.10.0-0.nightly-2022-05-26-102501 [jfrancoa@localhost 107382]$ oc get csv -n openshift-windows-machine-config-operator NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded windows-machine-config-operator.v5.1.0 Windows Machine Config Operator 5.1.0 windows-machine-config-operator.v5.0.0 Succeeded 3. Platform - Azure 5. Is it a new test case or an old test case? if it is the old test case, is it regression or first-time tested? Is it platform-specific or consistent across all platforms? It impacts an old test case, however I believe this was not observed before. 6. Steps to Reproduce * Create a OCP 4.10 cluster, install WMCO 5.1.0 and create a machineset following the docs. * Make sure the machines got propèrly created * Scale up the number of machines to one more: oc scale --replicas=n+1 machineset <name> -n openshift-machine-api * Wait for the machine to get into running state (which does not happen even after waiting for hours) 7. Actual Result and Expected Result The Windows machines from the machineset can be scaled up properly. 8. A possible workaround has been tried? Is there a way to recover from the issue being tried out? Setting the publiIP field to true does solve the issue. I realized that even though the docs suggest the creation of machinesets with publicIP set to false: https://docs.openshift.com/container-platform/4.10/windows_containers/creating_windows_machinesets/creating-windows-machineset-azure.html#windows-machineset-azure_creating-windows-machineset-azure all the machines created during the machineset creation did have a publicIP on Azure. However, the scaled up nodes which never got to Running state were missing the publicIP. Therefore, I created a second machineset with the publicIP field set to true and tried to scale it up, with no issues at all: [jfrancoa@localhost 107382]$ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE jfrancoa-3005-azure-qf9hb-worker-westus 3 3 3 3 9h win 3 3 3 3 89m windows 3 3 2 2 8h win -> publicIP = true windows -> publicIP = false 9. Logs Must-gather-windows-node-logs(https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_windows_node_logs#L24) oc get network.operator cluster -o yaml oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator Windows MachineSet yaml or windows-instances ConfigMap oc get machineset <windows_machineSet_name> -n openshift-machine-api -o yaml oc get configmaps <windows_configmap_name> -n <namespace_name> -o yaml Optional logs: Anything that can be useful to debug the issue.
Looking at the provided must gathers, there are three machines in the machineset which have publicIP: false, these are windows-r67jq, windows-tw5pg and windows-txxcf. Of those, only windows-tw5pg isn't running. According to the Machine status, the instance was created and is running in the cloud provider, it looks to me like the issue is somewhere on the software side, ie, kubelet hasn't requested a CSR for a client certificate, meaning it probably never started. As the other Machines are Running, I would suggest this was some transient failure, though I know this happens a lot on linux too. This either needs to be investigated by the team who owns WMCO or the teams who own MCO/Kubelet. Gonna send this to the WMCO for now as I suspect, given this is a windows node, they should probably try to debug it first. In the mean time, is there any way to try and gather logs from the Machine that never became running?
Having a debugging session with Sebastian, we did observe some highly probable issue with the user_data in those scaled up machines. The machine hanging in Provisioned state, didn't have even sshd configured: > sc.exe query sshd [SC] EnumQueryServicesStatus:OpenService FAILED 1060: The specified service does not exist as an installed service. Using the Azure AZ CLI, we can see that the scaled up nodes have the userData field set to null, but that's something which occurs in the nodes which provisioned successfully upon machineset creation: "name": "windows-hpq54_OSDisk", "osType": "Windows", "vhd": null, "writeAcceleratorEnabled": null } }, "tags": { "kubernetes.io-cluster-jfrancoa-0106-azure-4dpxz": "owned" }, "timeCreated": "2022-06-01T13:29:23.441806+00:00", "type": "Microsoft.Compute/virtualMachines", "userData": null, "virtualMachineScaleSet": null, "vmId": "95d97d25-74d3-4368-b19e-e9366c5613b6", "zones": null } Something observed is that a Public IP from a Load Balancer is assigned to every single node created by the machineset (except to the one that fails and stays in Provisioned state). And when checking the Backend pool rules, the host which stayes in Provisioned state is missing from the rules: * Node in provisioned during our debugging session -> windows-hpq54 Provisioned Standard_D2s_v3 westus 125m [jfrancoa@localhost 108317]$ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jfrancoa-0106-azure-4dpxz-master-0 Running Standard_D8s_v3 westus 4h38m jfrancoa-0106-azure-4dpxz-master-1 Running Standard_D8s_v3 westus 4h38m jfrancoa-0106-azure-4dpxz-master-2 Running Standard_D8s_v3 westus 4h38m jfrancoa-0106-azure-4dpxz-worker-westus-bjs26 Running Standard_D4s_v3 westus 4h28m jfrancoa-0106-azure-4dpxz-worker-westus-kkd76 Running Standard_D4s_v3 westus 4h28m jfrancoa-0106-azure-4dpxz-worker-westus-vvwht Running Standard_D4s_v3 westus 4h28m win-5cdvl Running Standard_D2s_v3 westus 94m win-87m6b Running Standard_D2s_v3 westus 94m win-rd8nt Running Standard_D2s_v3 westus 68m win-rnkvx Running Standard_D2s_v3 westus 94m windows-4z84l Running Standard_D2s_v3 westus 3h59m windows-b58vv Running Standard_D2s_v3 westus 3h59m windows-hpq54 Provisioned Standard_D2s_v3 westus 125m * Load balancer rules: jfrancoa-0106-azure-4dpxz jfrancoa-0106-azure-4dpxz-master-1 Running 10.0.0.6 jfrancoa-0106-azure-4dpxz-master-1-nic 6 jfrancoa-0106-azure-4dpxz jfrancoa-0106-azure-4dpxz-master-0 Running 10.0.0.8 jfrancoa-0106-azure-4dpxz-master-0-nic 6 jfrancoa-0106-azure-4dpxz jfrancoa-0106-azure-4dpxz-master-2 Running 10.0.0.7 jfrancoa-0106-azure-4dpxz-master-2-nic 6 jfrancoa-0106-azure-4dpxz jfrancoa-0106-azure-4dpxz-worker-westus-vvwht Running 10.0.128.4 jfrancoa-0106-azure-4dpxz-worker-westus-vvwht-nic 6 jfrancoa-0106-azure-4dpxz jfrancoa-0106-azure-4dpxz-worker-westus-bjs26 Running 10.0.128.5 jfrancoa-0106-azure-4dpxz-worker-westus-bjs26-nic 6 jfrancoa-0106-azure-4dpxz jfrancoa-0106-azure-4dpxz-worker-westus-kkd76 Running 10.0.128.6 jfrancoa-0106-azure-4dpxz-worker-westus-kkd76-nic 6 jfrancoa-0106-azure-4dpxz windows-b58vv Running 10.0.128.7 windows-b58vv-nic 6 jfrancoa-0106-azure-4dpxz windows-4z84l Running 10.0.128.8 windows-4z84l-nic 6 jfrancoa-0106-azure-4dpxz win-rnkvx Running 10.0.128.10 win-rnkvx-nic 6 jfrancoa-0106-azure-4dpxz win-5cdvl Running 10.0.128.11 win-5cdvl-nic 6 jfrancoa-0106-azure-4dpxz win-87m6b Running 10.0.128.12 win-87m6b-nic 6 jfrancoa-0106-azure-4dpxz win-rd8nt Running 10.0.128.13 win-rd8nt-nic 6 I have appended to the bug the output from the load balancer show AZ CLI command.
I could confirm the behavior stated in comment #5. In OCP 4.11 this issue does not occur, the scaled up machine from the machine set has a backend rule added to the Azure loadbalancer and thus, can be configured by WMCO. In OCP 4.10, the scaled up machine is not included in the load balancer rules and can't be ssh'ed. The question is, can't it be ssh'ed because of the lack of a public IP (from the load balancer) or because the user data injection failed (as it uses the load balancer's IP to push the user_data instead of the private IP) and therefore, sshd, nor the key are present in the Windows node to be able to access it.
This is a known issue and will be fixing future releases.
Still relevant and will be planned in a future sprint.
> scaled up nodes have the userData field set to null The WinC team reviewed this and are increasingly convinced it is not an issue in WMCO's responsibility.
@jfrancoa does this issue occur with a Linux MachineSet also?
No, it does not happen with the Linux MachineSet. Only with the Windows one: [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE jfrancoa-73-wjlkc-worker-westus 3 3 3 3 129m windows 2 2 2 2 94m [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get nodes NAME STATUS ROLES AGE VERSION jfrancoa-73-wjlkc-master-0 Ready control-plane,master 126m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-master-1 Ready control-plane,master 126m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-master-2 Ready control-plane,master 126m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-6b52f Ready worker 112m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-cgrnf Ready worker 111m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-jff9g Ready worker 112m v1.26.1+8cfbab7 windows-6tkqn Ready worker 57m v1.26.0+f854081 windows-z848c Ready worker 61m v1.26.0+f854081 [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc scale machineset jfrancoa-73-wjlkc-worker-westus -n openshift-machine-api --replicas=4 machineset.machine.openshift.io/jfrancoa-73-wjlkc-worker-westus scaled [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc scale machineset windows -n openshift-machine-api --replicas=3 machineset.machine.openshift.io/windows scaled #### WAIT MORE THAN 30 MINUTES ###### [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get nodes NAME STATUS ROLES AGE VERSION jfrancoa-73-wjlkc-master-0 Ready control-plane,master 3h v1.26.1+8cfbab7 jfrancoa-73-wjlkc-master-1 Ready control-plane,master 3h v1.26.1+8cfbab7 jfrancoa-73-wjlkc-master-2 Ready control-plane,master 3h v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-6b52f Ready worker 167m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-cgrnf Ready worker 166m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-jff9g Ready worker 166m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-mhm7x Ready worker 48m v1.26.1+8cfbab7 windows-6tkqn Ready worker 111m v1.26.0+f854081 windows-z848c Ready worker 115m v1.26.0+f854081 [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jfrancoa-73-wjlkc-master-0 Running Standard_D8s_v3 westus 3h5m jfrancoa-73-wjlkc-master-1 Running Standard_D8s_v3 westus 3h5m jfrancoa-73-wjlkc-master-2 Running Standard_D8s_v3 westus 3h5m jfrancoa-73-wjlkc-worker-westus-6b52f Running Standard_D4s_v3 westus 176m jfrancoa-73-wjlkc-worker-westus-cgrnf Running Standard_D4s_v3 westus 176m jfrancoa-73-wjlkc-worker-westus-jff9g Running Standard_D4s_v3 westus 176m jfrancoa-73-wjlkc-worker-westus-mhm7x Running Standard_D4s_v3 westus 54m windows-6tkqn Running Standard_D2s_v3 westus 124m windows-nsgtf Provisioned Standard_D2s_v3 westus 54m windows-z848c Running Standard_D2s_v3 westus 124m
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9292