Bug 2091642
Summary: | [WINC] Windows machineset can't scale up when publicIP is set to false on Azure | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jose Luis Franco <jfrancoa> |
Component: | Windows Containers | Assignee: | Team Windows Containers <team-winc-bot> |
Status: | CLOSED DEFERRED | QA Contact: | Jose Luis Franco <jfrancoa> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.10 | CC: | jvaldes, mburke, mohashai |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: |
Cause: Not known
Consequence: Windows machineset can't scale up
Workaround (if any): set publicIp: true in machineset
Result:
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2023-03-09 01:20:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jose Luis Franco
2022-05-30 15:04:32 UTC
Looking at the provided must gathers, there are three machines in the machineset which have publicIP: false, these are windows-r67jq, windows-tw5pg and windows-txxcf. Of those, only windows-tw5pg isn't running. According to the Machine status, the instance was created and is running in the cloud provider, it looks to me like the issue is somewhere on the software side, ie, kubelet hasn't requested a CSR for a client certificate, meaning it probably never started. As the other Machines are Running, I would suggest this was some transient failure, though I know this happens a lot on linux too. This either needs to be investigated by the team who owns WMCO or the teams who own MCO/Kubelet. Gonna send this to the WMCO for now as I suspect, given this is a windows node, they should probably try to debug it first. In the mean time, is there any way to try and gather logs from the Machine that never became running? Having a debugging session with Sebastian, we did observe some highly probable issue with the user_data in those scaled up machines. The machine hanging in Provisioned state, didn't have even sshd configured:
> sc.exe query sshd
[SC] EnumQueryServicesStatus:OpenService FAILED 1060:
The specified service does not exist as an installed service.
Using the Azure AZ CLI, we can see that the scaled up nodes have the userData field set to null, but that's something which occurs in the nodes which provisioned successfully upon machineset creation:
"name": "windows-hpq54_OSDisk",
"osType": "Windows",
"vhd": null,
"writeAcceleratorEnabled": null
}
},
"tags": {
"kubernetes.io-cluster-jfrancoa-0106-azure-4dpxz": "owned"
},
"timeCreated": "2022-06-01T13:29:23.441806+00:00",
"type": "Microsoft.Compute/virtualMachines",
"userData": null,
"virtualMachineScaleSet": null,
"vmId": "95d97d25-74d3-4368-b19e-e9366c5613b6",
"zones": null
}
Something observed is that a Public IP from a Load Balancer is assigned to every single node created by the machineset (except to the one that fails and stays in Provisioned state). And when checking the Backend pool rules, the host which stayes in Provisioned state is missing from the rules:
* Node in provisioned during our debugging session -> windows-hpq54 Provisioned Standard_D2s_v3 westus 125m
[jfrancoa@localhost 108317]$ oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
jfrancoa-0106-azure-4dpxz-master-0 Running Standard_D8s_v3 westus 4h38m
jfrancoa-0106-azure-4dpxz-master-1 Running Standard_D8s_v3 westus 4h38m
jfrancoa-0106-azure-4dpxz-master-2 Running Standard_D8s_v3 westus 4h38m
jfrancoa-0106-azure-4dpxz-worker-westus-bjs26 Running Standard_D4s_v3 westus 4h28m
jfrancoa-0106-azure-4dpxz-worker-westus-kkd76 Running Standard_D4s_v3 westus 4h28m
jfrancoa-0106-azure-4dpxz-worker-westus-vvwht Running Standard_D4s_v3 westus 4h28m
win-5cdvl Running Standard_D2s_v3 westus 94m
win-87m6b Running Standard_D2s_v3 westus 94m
win-rd8nt Running Standard_D2s_v3 westus 68m
win-rnkvx Running Standard_D2s_v3 westus 94m
windows-4z84l Running Standard_D2s_v3 westus 3h59m
windows-b58vv Running Standard_D2s_v3 westus 3h59m
windows-hpq54 Provisioned Standard_D2s_v3 westus 125m
* Load balancer rules:
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-master-1
Running
10.0.0.6
jfrancoa-0106-azure-4dpxz-master-1-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-master-0
Running
10.0.0.8
jfrancoa-0106-azure-4dpxz-master-0-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-master-2
Running
10.0.0.7
jfrancoa-0106-azure-4dpxz-master-2-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-worker-westus-vvwht
Running
10.0.128.4
jfrancoa-0106-azure-4dpxz-worker-westus-vvwht-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-worker-westus-bjs26
Running
10.0.128.5
jfrancoa-0106-azure-4dpxz-worker-westus-bjs26-nic
6
jfrancoa-0106-azure-4dpxz
jfrancoa-0106-azure-4dpxz-worker-westus-kkd76
Running
10.0.128.6
jfrancoa-0106-azure-4dpxz-worker-westus-kkd76-nic
6
jfrancoa-0106-azure-4dpxz
windows-b58vv
Running
10.0.128.7
windows-b58vv-nic
6
jfrancoa-0106-azure-4dpxz
windows-4z84l
Running
10.0.128.8
windows-4z84l-nic
6
jfrancoa-0106-azure-4dpxz
win-rnkvx
Running
10.0.128.10
win-rnkvx-nic
6
jfrancoa-0106-azure-4dpxz
win-5cdvl
Running
10.0.128.11
win-5cdvl-nic
6
jfrancoa-0106-azure-4dpxz
win-87m6b
Running
10.0.128.12
win-87m6b-nic
6
jfrancoa-0106-azure-4dpxz
win-rd8nt
Running
10.0.128.13
win-rd8nt-nic
6
I have appended to the bug the output from the load balancer show AZ CLI command.
I could confirm the behavior stated in comment #5. In OCP 4.11 this issue does not occur, the scaled up machine from the machine set has a backend rule added to the Azure loadbalancer and thus, can be configured by WMCO. In OCP 4.10, the scaled up machine is not included in the load balancer rules and can't be ssh'ed. The question is, can't it be ssh'ed because of the lack of a public IP (from the load balancer) or because the user data injection failed (as it uses the load balancer's IP to push the user_data instead of the private IP) and therefore, sshd, nor the key are present in the Windows node to be able to access it. This is a known issue and will be fixing future releases. Still relevant and will be planned in a future sprint. > scaled up nodes have the userData field set to null
The WinC team reviewed this and are increasingly convinced it is not an issue in WMCO's responsibility.
@jfrancoa does this issue occur with a Linux MachineSet also? No, it does not happen with the Linux MachineSet. Only with the Windows one: [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE jfrancoa-73-wjlkc-worker-westus 3 3 3 3 129m windows 2 2 2 2 94m [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get nodes NAME STATUS ROLES AGE VERSION jfrancoa-73-wjlkc-master-0 Ready control-plane,master 126m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-master-1 Ready control-plane,master 126m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-master-2 Ready control-plane,master 126m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-6b52f Ready worker 112m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-cgrnf Ready worker 111m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-jff9g Ready worker 112m v1.26.1+8cfbab7 windows-6tkqn Ready worker 57m v1.26.0+f854081 windows-z848c Ready worker 61m v1.26.0+f854081 [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc scale machineset jfrancoa-73-wjlkc-worker-westus -n openshift-machine-api --replicas=4 machineset.machine.openshift.io/jfrancoa-73-wjlkc-worker-westus scaled [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc scale machineset windows -n openshift-machine-api --replicas=3 machineset.machine.openshift.io/windows scaled #### WAIT MORE THAN 30 MINUTES ###### [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get nodes NAME STATUS ROLES AGE VERSION jfrancoa-73-wjlkc-master-0 Ready control-plane,master 3h v1.26.1+8cfbab7 jfrancoa-73-wjlkc-master-1 Ready control-plane,master 3h v1.26.1+8cfbab7 jfrancoa-73-wjlkc-master-2 Ready control-plane,master 3h v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-6b52f Ready worker 167m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-cgrnf Ready worker 166m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-jff9g Ready worker 166m v1.26.1+8cfbab7 jfrancoa-73-wjlkc-worker-westus-mhm7x Ready worker 48m v1.26.1+8cfbab7 windows-6tkqn Ready worker 111m v1.26.0+f854081 windows-z848c Ready worker 115m v1.26.0+f854081 [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jfrancoa-73-wjlkc-master-0 Running Standard_D8s_v3 westus 3h5m jfrancoa-73-wjlkc-master-1 Running Standard_D8s_v3 westus 3h5m jfrancoa-73-wjlkc-master-2 Running Standard_D8s_v3 westus 3h5m jfrancoa-73-wjlkc-worker-westus-6b52f Running Standard_D4s_v3 westus 176m jfrancoa-73-wjlkc-worker-westus-cgrnf Running Standard_D4s_v3 westus 176m jfrancoa-73-wjlkc-worker-westus-jff9g Running Standard_D4s_v3 westus 176m jfrancoa-73-wjlkc-worker-westus-mhm7x Running Standard_D4s_v3 westus 54m windows-6tkqn Running Standard_D2s_v3 westus 124m windows-nsgtf Provisioned Standard_D2s_v3 westus 54m windows-z848c Running Standard_D2s_v3 westus 124m OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9292 |