1952739 – [vsphere] error powering on vm: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm

Bug 1952739 - [vsphere] error powering on vm: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm

Summary: [vsphere] error powering on vm: blockUntilReadyForMethod: timed out while wai...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Brent Barbachem
QA Contact:	jima
Docs Contact:
URL:
Whiteboard:
Depends On:	1981941
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-23 02:48 UTC by jima
Modified:	2022-11-14 19:19 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-21 14:49:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description jima 2021-04-23 02:48:59 UTC

Description of problem:
When installing ocp on VMC, sometimes detect that vm could not be powered on with below error:

09:40:12  Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-272369
09:40:12  
09:40:12    on upi_on_vsphere-terraform-scripts/vm/main.tf line 1, in resource "vsphere_virtual_machine" "vm":
09:40:12     1: resource "vsphere_virtual_machine" "vm" {

On vCenter GUI, this vm is powered off,  "power on" button is gray, unable to power on it manually, but can be booted up through govc command line.

$ govc vm.power -on=true -vm.uuid=422c3ce3-4bea-5de2-818c-81726f3b4b77
Powering on VirtualMachine:vm-272369... OK

Above error can also be searched in upstream CI job.
https://search.ci.openshift.org/?search=Error%3A+error+powering+on+virtual+machine%3A+blockUntilReadyForMethod%3A+timed+out+while+waiting+for+%22PowerOnVM_Task%22+to+become+available+for+vm&maxAge=336h&context=0&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Maybe related with this commit.
https://github.com/hashicorp/terraform-provider-vsphere/commit/235d3d3090cc6f6a153df2b281a6d31bc25920b7

Version-Release number of selected component (if applicable):
hit the issue on ocp 4.6/4.7/4.8 nightly build recently

How reproducible:
sometimes

Steps to Reproduce:
1. Install ocp on VMC
2.
3.

Actual results:
one node could not be powered on and installation is failed.

Expected results:
All nodes should be booted up.

Additional info:

Comment 1 Joseph Callen 2021-05-21 20:09:50 UTC

We moved to using the internal ipv4 address of vCenter in CI via DNS. I am wondering if that helped resolve this.
I have seen this error as well but not very often.

Either way I would think this is a performance problem of the underlying infra and nothing that can be fixed in the installer (unless additional retries could be added but since a customer hasn't hit probably not necessary).

Comment 2 Russell Teague 2021-07-12 17:50:44 UTC

Is this issue still being encountered or does the suggestion above help?

We will review this bug again and close if it is no longer an issue.

Comment 5 sunzhaohua 2021-07-30 01:34:30 UTC

Met this again when setup a 4.8.2 cluster
https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/32969/console

07-30 08:10:48.824  Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-444830
07-30 08:10:48.824  
07-30 08:10:48.824    on upi_on_vsphere-terraform-scripts/vm/main.tf line 1, in resource "vsphere_virtual_machine" "vm":
07-30 08:10:48.824     1: resource "vsphere_virtual_machine" "vm" {

Comment 6 Russell Teague 2021-08-02 17:35:48 UTC

Need to address, keeping same priority.

Comment 7 Russell Teague 2021-08-24 17:31:41 UTC

Waiting for terraform upgrade.

Comment 9 jima 2022-03-01 01:46:51 UTC

This issue is detected again in QE CI recently

03-01 03:40:57.982  level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-744331
03-01 03:40:57.983  level=error
03-01 03:40:57.983  level=error msg=  on ../../../../../../../../tmp/openshift-install-master-1900381675/main.tf line 12, in resource "vsphere_virtual_machine" "vm_master":
03-01 03:40:57.983  level=error msg=  12: resource "vsphere_virtual_machine" "vm_master" {
03-01 03:40:57.983  level=error
03-01 03:40:57.983  level=error
03-01 03:40:57.983  level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
03-01 03:40:57.983  [ERROR] Installation failed with error code '1'. Aborting execution.

Comment 12 jima 2022-03-02 01:52:24 UTC

met failures twice on 4.10 from QE side.
- 4.10.0-0.nightly-2022-02-24-034852
- 4.10.0-rc.6-x86_64

Searched in CI, failures are most on 4.10 and 4.11.
https://search.ci.openshift.org/?search=timed+out+while+waiting+for+%22PowerOnVM_Task%22+to+become+available+for+vm&maxAge=336h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 13 Patrick Dillon 2022-04-12 15:50:45 UTC

vSphere Terraform Provider was updated in https://github.com/openshift/installer/pull/5694 which merged on March 16. 

This appears to still be failing:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5636/pull-ci-openshift-installer-master-e2e-vsphere/1511639662169427968

Comment 20 RamaKasturi 2022-06-27 09:58:02 UTC

Met this issue today again on QE CI.

06-25 09:13:14.937  level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-872924
06-25 09:13:14.937  level=error
06-25 09:13:14.937  level=error msg=  with vsphere_virtual_machine.vm_bootstrap,
06-25 09:13:14.937  level=error msg=  on main.tf line 12, in resource "vsphere_virtual_machine" "vm_bootstrap":
06-25 09:13:14.937  level=error msg=  12: resource "vsphere_virtual_machine" "vm_bootstrap" {
06-25 09:13:14.937  level=error
06-25 09:13:14.937  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
06-25 09:13:14.937  level=error
06-25 09:13:14.937  level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-872924
06-25 09:13:14.937  level=error
06-25 09:13:14.937  level=error msg=  with vsphere_virtual_machine.vm_bootstrap,
06-25 09:13:14.937  level=error msg=  on main.tf line 12, in resource "vsphere_virtual_machine" "vm_bootstrap":
06-25 09:13:14.938  level=error msg=  12: resource "vsphere_virtual_machine" "vm_bootstrap" {
06-25 09:13:14.938  level=error
06-25 09:13:14.938  level=error
06-25 09:13:14.938  [ERROR] Installation failed with error code '4'. Aborting execution.
06-25 09:13:14.938  tools/launch_instance.rb:624:in `installation_task': shell command failed execution, see logs (RuntimeError)
06-25 09:13:14.938  	from tools/launch_instance.rb:763:in `block in launch_template'
06-25 09:13:14.938  	from tools/launch_instance.rb:762:in `each'
06-25 09:13:14.938  	from tools/launch_instance.rb:762:in `launch_template'
06-25 09:13:14.938  	from tools/launch_instance.rb:55:in `block (2 levels) in run'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/command.rb:187:in `call'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/command.rb:157:in `run'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/runner.rb:444:in `run_active_command'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/runner.rb:58:in `run!'
06-25 09:13:14.938  	from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/delegates.rb:18:in `run!'
06-25 09:13:14.938  	from tools/launch_instance.rb:92:in `run'
06-25 09:13:14.938  	from tools/launch_instance.rb:904:in `<main>'
06-25 09:13:14.938  waiting for operation up to 36000 seconds..

Comment 21 Joseph Callen 2022-06-27 14:16:39 UTC

** Everyone that has experienced this issue please read below **

1.) This is a very random and rare issue
2.) Providing log snip-its of the installer tells us nothing new

3.) Logs required:

- Which vSphere environment is being used for this install?
- vCenter/ESXi logs see https://kb.vmware.com/s/article/653, vcenter and the esxi host where the virtual machine resides
- The virtual machine that failed to start - logs (directory where virtual machine exists on datastore)
- Full .openshift-install.log
- install-config.yaml

Without the above information it will be very difficult to determine the vcenter/esxi specific cause and a resolution.

Comment 23 Patrick Dillon 2022-07-21 14:49:10 UTC

Closing this as WONTFIX. There is not much the installer can do if the vSphere environment's VMs will not power on. 

We considered adding a bump to the poweron timer, but the default is already 5 minutes, which is essentially a lifetime. We have seen this failure where VMs don't power on for hours, so we don't want to obscure the underlying issue.

The error message from terraform is already relatively clear.

Note You need to log in before you can comment on or make changes to this bug.