Description of problem: Sometimes we see VMs fail to power on when the land on a host that does not have enough resources. The current power on does not retry or leverage DRS to power on the node on a suitable host. https://github.com/vmware/govmomi/issues/1026 Our code is still making calls to PowerOnVM_Task which, according to the vsphere docs, is deprecated and we should use PowerOnMultiVM_Task instead. PowerOnVM_Task does not return a DRS ClusterRecommendation, no vmotion nor host power operations will be done as part of a DRS-facilitated power on. To have DRS consider such operations use PowerOnMultiVM_Task. https://vdc-download.vmware.com/vmwb-repository/dcr-public/b50dcbbf-051d-4204-a3e7-e1b618c1e384/538cf2ec-b34f-4bae-a332-3820ef9e7773/vim.VirtualMachine.html#powerOn: As of vSphere API 5.1, use of this method with vCenter Server is deprecated; use PowerOnMultiVM_Task instead. Version-Release number of selected component (if applicable): 4.8.x How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Sometimes powers on fails requiring manual intervention. Expected results: PowerOn should use DRS to ensure it's always successful. Additional info:
After coming across this issue I started to research what the terraform provider currently does. It also only uses "PowerOnVM_Task". The concern with PowerOnMultiVM_Task would be a DRS cluster that is set to manual. As stated in the SDK doc: > If any virtual machine in the list is manually managed by DRS, or DRS has to migrate any manually managed virtual machine or power on any manually managed host in order to power on these virtual machines, a DRS recommendation will be generated, and > *** the users need to manually apply the recommendation for actually powering on these virtual machines. *** ^ via vCenter UI the end user would need to interact for that virtual machine to be powered on. There are options that you can provide to the PowerOnMultiVM_Tasks to override the current cluster DRS configuration. The only example of this I found was here: https://github.com/vmware/vic/blob/6c70cfedabb689d0b97bfa32ba3bc92119ec4860/pkg/vsphere/vm/vm.go#L966-L1004
Unless I am mistaken it doesn't look like MAO uses the method: reconciler.go:func (vm *virtualMachine) powerOnVM() (string, error) { or PowerOnVM_Task The virtual machines instead are powered on with the cloning operation, see the spec. 705 spec := types.VirtualMachineCloneSpec{ 706 Config: &types.VirtualMachineConfigSpec{ 707 Annotation: s.machine.GetName(), 708 // Assign the clone's InstanceUUID the value of the Kubernetes Machine 709 // object's UID. This allows lookup of the cloned VM prior to knowing 710 // the VM's UUID. 711 InstanceUuid: string(s.machine.UID), 712 Flags: newVMFlagInfo(), 713 ExtraConfig: extraConfig, 714 DeviceChange: deviceSpecs, 715 NumCPUs: numCPUs, 716 NumCoresPerSocket: numCoresPerSocket, 717 MemoryMB: s.providerSpec.MemoryMiB, 718 }, 719 Location: types.VirtualMachineRelocateSpec{ 720 Datastore: types.NewReference(datastore.Reference()), 721 Folder: types.NewReference(folder.Reference()), 722 Pool: types.NewReference(resourcepool.Reference()), 723 DiskMoveType: diskMoveType, 724 }, 725 PowerOn: true, <------------------------------------------------------------------- 726 Snapshot: snapshotRef, 727 } I do have an open question out for the VMware team in regards to PowerOnVM_Task vs. PowerOnMultiVM_Task. It looks like based on testing DRS does move the guest before poweron which is reported via events. Started a branch for the terraform provider change. Will wait for VMware's response before submitting. https://github.com/jcpowermac/terraform-provider-vsphere/tree/use_poweronmultivm_task
Denis is going to explore the idea of moving to a clone with PowerOn: false and then powering on later using the new PowerOnMultiVM_Task instead. Expect an update within the next few weeks.
We are scheduling time to look into this next sprint
Setting target to 4.12 for being able to merge. Lets consider backport, after this bug would be verified.
Tried several times, create some new machines via scale machineset, all machines can get Running and succeed to power on, move this to Verified. liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-08-05-000006 True False 78m Cluster version is 4.12.0-0.nightly-2022-08-05-000006 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs12a-tq5sp-master-0 Running 106m huliu-vs12a-tq5sp-master-1 Running 106m huliu-vs12a-tq5sp-master-2 Running 106m huliu-vs12a-tq5sp-worker-545k8 Running 16m huliu-vs12a-tq5sp-worker-7jvb6 Running 16m huliu-vs12a-tq5sp-worker-88mwq Running 16m huliu-vs12a-tq5sp-worker-8pg7p Running 16m huliu-vs12a-tq5sp-worker-926g5 Running 16m huliu-vs12a-tq5sp-worker-9h5hl Running 16m huliu-vs12a-tq5sp-worker-cjlwl Running 16m huliu-vs12a-tq5sp-worker-g7q5r Running 16m huliu-vs12a-tq5sp-worker-h7nbb Running 16m huliu-vs12a-tq5sp-worker-j7s5v Running 16m huliu-vs12a-tq5sp-worker-jshlr Running 16m huliu-vs12a-tq5sp-worker-mc8c2 Running 16m huliu-vs12a-tq5sp-worker-ncckc Running 16m huliu-vs12a-tq5sp-worker-rbbck Running 50m huliu-vs12a-tq5sp-worker-ss2lz Running 16m huliu-vs12a-tq5sp-worker-t9nlk Running 16m huliu-vs12a-tq5sp-worker-twkd7 Running 16m huliu-vs12a-tq5sp-worker-wzbvb Running 16m huliu-vs12a-tq5sp-worker-x7mfc Running 60m huliu-vs12a-tq5sp-worker-xtm8r Running 16m liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml |grep instanceState instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn instanceState: poweredOn liuhuali@Lius-MacBook-Pro huali-test %
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399