Bug 2087981 - PowerOnVM_Task is deprecated use PowerOnMultiVM_Task for DRS ClusterRecommendation
Summary: PowerOnVM_Task is deprecated use PowerOnMultiVM_Task for DRS ClusterRecommend...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: x86_64
OS: All
medium
high
Target Milestone: ---
: 4.12.0
Assignee: dmoiseev
QA Contact: Huali Liu
Jeana Routh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-18 15:24 UTC by Matthew Robson
Modified: 2023-01-17 19:49 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, the Machine API vSphere machine controller set the `PowerOn` flag when cloning a VM. This created a `PowerOn` task that the machine controller was not aware of. If that `PowerOn` task failed, machines were stuck in the `Provisioned` phase but never powered on. With this release, the cloning sequence is altered to avoid the issue. Additionally, the machine controller now retries powering on the VM in case of failure and reports failures properly. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2087981[*BZ#2087981*], link:https://issues.redhat.com/browse/OCPBUGS-954[*OCPBUGS-954*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:48:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 1047 0 None open Bug 2087981: Change "create" sequence with powering on the vm after clone 2022-07-27 14:49:04 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:49:15 UTC

Description Matthew Robson 2022-05-18 15:24:50 UTC
Description of problem:

Sometimes we see VMs fail to power on when the land on a host that does not have enough resources. The current power on does not retry or leverage DRS to power on the node on a suitable host.

https://github.com/vmware/govmomi/issues/1026

Our code is still making calls to PowerOnVM_Task which, according to the vsphere docs, is deprecated and we should use PowerOnMultiVM_Task instead.

PowerOnVM_Task does not return a DRS ClusterRecommendation, no vmotion nor host power operations will be done as part of a DRS-facilitated power on. To have DRS consider such operations use PowerOnMultiVM_Task.

https://vdc-download.vmware.com/vmwb-repository/dcr-public/b50dcbbf-051d-4204-a3e7-e1b618c1e384/538cf2ec-b34f-4bae-a332-3820ef9e7773/vim.VirtualMachine.html#powerOn:
As of vSphere API 5.1, use of this method with vCenter Server is deprecated; use PowerOnMultiVM_Task instead. 


Version-Release number of selected component (if applicable):
4.8.x

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Sometimes powers on fails requiring manual intervention.

Expected results:
PowerOn should use DRS to ensure it's always successful.

Additional info:

Comment 1 Joseph Callen 2022-05-31 18:23:36 UTC
After coming across this issue I started to research what the terraform provider currently does. It also only uses "PowerOnVM_Task". 
The concern with PowerOnMultiVM_Task would be a DRS cluster that is set to manual. As stated in the SDK doc:

> If any virtual machine in the list is manually managed by DRS, or DRS has to migrate any manually managed virtual machine or power on any manually managed host in order to power on these virtual machines, a DRS recommendation will be generated, and 
> *** the users need to manually apply the recommendation for actually powering on these virtual machines. ***

^ via vCenter UI the end user would need to interact for that virtual machine to be powered on.

There are options that you can provide to the PowerOnMultiVM_Tasks to override the current cluster DRS configuration. The only example of this I found was here:

https://github.com/vmware/vic/blob/6c70cfedabb689d0b97bfa32ba3bc92119ec4860/pkg/vsphere/vm/vm.go#L966-L1004

Comment 2 Joseph Callen 2022-06-01 19:40:51 UTC
Unless I am mistaken it doesn't look like MAO uses the method:

reconciler.go:func (vm *virtualMachine) powerOnVM() (string, error) {
or
PowerOnVM_Task

The virtual machines instead are powered on with the cloning operation, see the spec.

 705     spec := types.VirtualMachineCloneSpec{
 706         Config: &types.VirtualMachineConfigSpec{
 707             Annotation: s.machine.GetName(),
 708             // Assign the clone's InstanceUUID the value of the Kubernetes Machine
 709             // object's UID. This allows lookup of the cloned VM prior to knowing
 710             // the VM's UUID.
 711             InstanceUuid:      string(s.machine.UID),
 712             Flags:             newVMFlagInfo(),
 713             ExtraConfig:       extraConfig,
 714             DeviceChange:      deviceSpecs,
 715             NumCPUs:           numCPUs,
 716             NumCoresPerSocket: numCoresPerSocket,
 717             MemoryMB:          s.providerSpec.MemoryMiB,
 718         },
 719         Location: types.VirtualMachineRelocateSpec{
 720             Datastore:    types.NewReference(datastore.Reference()),
 721             Folder:       types.NewReference(folder.Reference()),
 722             Pool:         types.NewReference(resourcepool.Reference()),
 723             DiskMoveType: diskMoveType,
 724         },
 725         PowerOn:  true, <-------------------------------------------------------------------
 726         Snapshot: snapshotRef,                                                                                                                                                            
 727     }


I do have an open question out for the VMware team in regards to PowerOnVM_Task vs. PowerOnMultiVM_Task.
It looks like based on testing DRS does move the guest before poweron which is reported via events.

Started a branch for the terraform provider change. Will wait for VMware's response before submitting.
https://github.com/jcpowermac/terraform-provider-vsphere/tree/use_poweronmultivm_task

Comment 3 Joel Speed 2022-06-16 13:40:32 UTC
Denis is going to explore the idea of moving to a clone with PowerOn: false and then powering on later using the new PowerOnMultiVM_Task instead. Expect an update within the next few weeks.

Comment 4 Joel Speed 2022-07-14 15:28:50 UTC
We are scheduling time to look into this next sprint

Comment 5 dmoiseev 2022-07-27 14:48:38 UTC
Setting target to 4.12 for being able to merge. Lets consider backport, after this bug would be verified.

Comment 7 Huali Liu 2022-08-05 08:33:42 UTC
Tried several times, create some new machines via scale machineset, all machines can get Running and succeed to power on, move this to Verified.

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-08-05-000006   True        False         78m     Cluster version is 4.12.0-0.nightly-2022-08-05-000006
liuhuali@Lius-MacBook-Pro huali-test % oc get machine       
NAME                             PHASE     TYPE   REGION   ZONE   AGE
huliu-vs12a-tq5sp-master-0       Running                          106m
huliu-vs12a-tq5sp-master-1       Running                          106m
huliu-vs12a-tq5sp-master-2       Running                          106m
huliu-vs12a-tq5sp-worker-545k8   Running                          16m
huliu-vs12a-tq5sp-worker-7jvb6   Running                          16m
huliu-vs12a-tq5sp-worker-88mwq   Running                          16m
huliu-vs12a-tq5sp-worker-8pg7p   Running                          16m
huliu-vs12a-tq5sp-worker-926g5   Running                          16m
huliu-vs12a-tq5sp-worker-9h5hl   Running                          16m
huliu-vs12a-tq5sp-worker-cjlwl   Running                          16m
huliu-vs12a-tq5sp-worker-g7q5r   Running                          16m
huliu-vs12a-tq5sp-worker-h7nbb   Running                          16m
huliu-vs12a-tq5sp-worker-j7s5v   Running                          16m
huliu-vs12a-tq5sp-worker-jshlr   Running                          16m
huliu-vs12a-tq5sp-worker-mc8c2   Running                          16m
huliu-vs12a-tq5sp-worker-ncckc   Running                          16m
huliu-vs12a-tq5sp-worker-rbbck   Running                          50m
huliu-vs12a-tq5sp-worker-ss2lz   Running                          16m
huliu-vs12a-tq5sp-worker-t9nlk   Running                          16m
huliu-vs12a-tq5sp-worker-twkd7   Running                          16m
huliu-vs12a-tq5sp-worker-wzbvb   Running                          16m
huliu-vs12a-tq5sp-worker-x7mfc   Running                          60m
huliu-vs12a-tq5sp-worker-xtm8r   Running                          16m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml |grep instanceState
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
      instanceState: poweredOn
liuhuali@Lius-MacBook-Pro huali-test %

Comment 12 errata-xmlrpc 2023-01-17 19:48:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.