Bug 1952739

Summary: [vsphere] error powering on vm: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm
Product: OpenShift Container Platform Reporter: jima
Component: InstallerAssignee: Brent Barbachem <bbarbach>
Installer sub component: openshift-installer QA Contact: jima
Status: CLOSED WONTFIX Docs Contact:
Severity: low    
Priority: medium CC: bbarbach, jcallen, knarra, mbukatov, padillon, rvanderp, sanchezl, vavuthu, zhsun
Version: 4.8   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-21 14:49:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1981941    
Bug Blocks:    

Description jima 2021-04-23 02:48:59 UTC
Description of problem:
When installing ocp on VMC, sometimes detect that vm could not be powered on with below error:

09:40:12  Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-272369
09:40:12  
09:40:12    on upi_on_vsphere-terraform-scripts/vm/main.tf line 1, in resource "vsphere_virtual_machine" "vm":
09:40:12     1: resource "vsphere_virtual_machine" "vm" {

On vCenter GUI, this vm is powered off,  "power on" button is gray, unable to power on it manually, but can be booted up through govc command line.

$ govc vm.power -on=true -vm.uuid=422c3ce3-4bea-5de2-818c-81726f3b4b77
Powering on VirtualMachine:vm-272369... OK

Above error can also be searched in upstream CI job.
https://search.ci.openshift.org/?search=Error%3A+error+powering+on+virtual+machine%3A+blockUntilReadyForMethod%3A+timed+out+while+waiting+for+%22PowerOnVM_Task%22+to+become+available+for+vm&maxAge=336h&context=0&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Maybe related with this commit.
https://github.com/hashicorp/terraform-provider-vsphere/commit/235d3d3090cc6f6a153df2b281a6d31bc25920b7

Version-Release number of selected component (if applicable):
hit the issue on ocp 4.6/4.7/4.8 nightly build recently

How reproducible:
sometimes

Steps to Reproduce:
1. Install ocp on VMC
2.
3.

Actual results:
one node could not be powered on and installation is failed.

Expected results:
All nodes should be booted up.

Additional info:

Comment 1 Joseph Callen 2021-05-21 20:09:50 UTC
We moved to using the internal ipv4 address of vCenter in CI via DNS. I am wondering if that helped resolve this.
I have seen this error as well but not very often.

Either way I would think this is a performance problem of the underlying infra and nothing that can be fixed in the installer (unless additional retries could be added but since a customer hasn't hit probably not necessary).

Comment 2 Russell Teague 2021-07-12 17:50:44 UTC
Is this issue still being encountered or does the suggestion above help?

We will review this bug again and close if it is no longer an issue.

Comment 5 sunzhaohua 2021-07-30 01:34:30 UTC
Met this again when setup a 4.8.2 cluster
https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/32969/console

07-30 08:10:48.824  Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-444830
07-30 08:10:48.824  
07-30 08:10:48.824    on upi_on_vsphere-terraform-scripts/vm/main.tf line 1, in resource "vsphere_virtual_machine" "vm":
07-30 08:10:48.824     1: resource "vsphere_virtual_machine" "vm" {

Comment 6 Russell Teague 2021-08-02 17:35:48 UTC
Need to address, keeping same priority.

Comment 7 Russell Teague 2021-08-24 17:31:41 UTC
Waiting for terraform upgrade.

Comment 9 jima 2022-03-01 01:46:51 UTC
This issue is detected again in QE CI recently

03-01 03:40:57.982  level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-744331
03-01 03:40:57.983  level=error
03-01 03:40:57.983  level=error msg=  on ../../../../../../../../tmp/openshift-install-master-1900381675/main.tf line 12, in resource "vsphere_virtual_machine" "vm_master":
03-01 03:40:57.983  level=error msg=  12: resource "vsphere_virtual_machine" "vm_master" {
03-01 03:40:57.983  level=error
03-01 03:40:57.983  level=error
03-01 03:40:57.983  level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
03-01 03:40:57.983  [ERROR] Installation failed with error code '1'. Aborting execution.

Comment 12 jima 2022-03-02 01:52:24 UTC
met failures twice on 4.10 from QE side.
- 4.10.0-0.nightly-2022-02-24-034852
- 4.10.0-rc.6-x86_64

Searched in CI, failures are most on 4.10 and 4.11.
https://search.ci.openshift.org/?search=timed+out+while+waiting+for+%22PowerOnVM_Task%22+to+become+available+for+vm&maxAge=336h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 13 Patrick Dillon 2022-04-12 15:50:45 UTC
vSphere Terraform Provider was updated in https://github.com/openshift/installer/pull/5694 which merged on March 16. 

This appears to still be failing:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5636/pull-ci-openshift-installer-master-e2e-vsphere/1511639662169427968

Comment 20 RamaKasturi 2022-06-27 09:58:02 UTC
Met this issue today again on QE CI.

06-25 09:13:14.937  level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-872924
06-25 09:13:14.937  level=error
06-25 09:13:14.937  level=error msg=  with vsphere_virtual_machine.vm_bootstrap,
06-25 09:13:14.937  level=error msg=  on main.tf line 12, in resource "vsphere_virtual_machine" "vm_bootstrap":
06-25 09:13:14.937  level=error msg=  12: resource "vsphere_virtual_machine" "vm_bootstrap" {
06-25 09:13:14.937  level=error
06-25 09:13:14.937  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
06-25 09:13:14.937  level=error
06-25 09:13:14.937  level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-872924
06-25 09:13:14.937  level=error
06-25 09:13:14.937  level=error msg=  with vsphere_virtual_machine.vm_bootstrap,
06-25 09:13:14.937  level=error msg=  on main.tf line 12, in resource "vsphere_virtual_machine" "vm_bootstrap":
06-25 09:13:14.938  level=error msg=  12: resource "vsphere_virtual_machine" "vm_bootstrap" {
06-25 09:13:14.938  level=error
06-25 09:13:14.938  level=error
06-25 09:13:14.938  [ERROR] Installation failed with error code '4'. Aborting execution.
06-25 09:13:14.938  tools/launch_instance.rb:624:in `installation_task': shell command failed execution, see logs (RuntimeError)
06-25 09:13:14.938  	from tools/launch_instance.rb:763:in `block in launch_template'
06-25 09:13:14.938  	from tools/launch_instance.rb:762:in `each'
06-25 09:13:14.938  	from tools/launch_instance.rb:762:in `launch_template'
06-25 09:13:14.938  	from tools/launch_instance.rb:55:in `block (2 levels) in run'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/command.rb:187:in `call'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/command.rb:157:in `run'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/runner.rb:444:in `run_active_command'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/runner.rb:58:in `run!'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/delegates.rb:18:in `run!'
06-25 09:13:14.938  	from tools/launch_instance.rb:92:in `run'
06-25 09:13:14.938  	from tools/launch_instance.rb:904:in `<main>'
06-25 09:13:14.938  waiting for operation up to 36000 seconds..

Comment 21 Joseph Callen 2022-06-27 14:16:39 UTC
** Everyone that has experienced this issue please read below **

1.) This is a very random and rare issue
2.) Providing log snip-its of the installer tells us nothing new

3.) Logs required:

- Which vSphere environment is being used for this install?
- vCenter/ESXi logs see https://kb.vmware.com/s/article/653, vcenter and the esxi host where the virtual machine resides
- The virtual machine that failed to start - logs (directory where virtual machine exists on datastore)
- Full .openshift-install.log
- install-config.yaml

Without the above information it will be very difficult to determine the vcenter/esxi specific cause and a resolution.

Comment 23 Patrick Dillon 2022-07-21 14:49:10 UTC
Closing this as WONTFIX. There is not much the installer can do if the vSphere environment's VMs will not power on. 

We considered adding a bump to the poweron timer, but the default is already 5 minutes, which is essentially a lifetime. We have seen this failure where VMs don't power on for hours, so we don't want to obscure the underlying issue.

The error message from terraform is already relatively clear.