Description of problem: When installing ocp on VMC, sometimes detect that vm could not be powered on with below error: 09:40:12 Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-272369 09:40:12 09:40:12 on upi_on_vsphere-terraform-scripts/vm/main.tf line 1, in resource "vsphere_virtual_machine" "vm": 09:40:12 1: resource "vsphere_virtual_machine" "vm" { On vCenter GUI, this vm is powered off, "power on" button is gray, unable to power on it manually, but can be booted up through govc command line. $ govc vm.power -on=true -vm.uuid=422c3ce3-4bea-5de2-818c-81726f3b4b77 Powering on VirtualMachine:vm-272369... OK Above error can also be searched in upstream CI job. https://search.ci.openshift.org/?search=Error%3A+error+powering+on+virtual+machine%3A+blockUntilReadyForMethod%3A+timed+out+while+waiting+for+%22PowerOnVM_Task%22+to+become+available+for+vm&maxAge=336h&context=0&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Maybe related with this commit. https://github.com/hashicorp/terraform-provider-vsphere/commit/235d3d3090cc6f6a153df2b281a6d31bc25920b7 Version-Release number of selected component (if applicable): hit the issue on ocp 4.6/4.7/4.8 nightly build recently How reproducible: sometimes Steps to Reproduce: 1. Install ocp on VMC 2. 3. Actual results: one node could not be powered on and installation is failed. Expected results: All nodes should be booted up. Additional info:
We moved to using the internal ipv4 address of vCenter in CI via DNS. I am wondering if that helped resolve this. I have seen this error as well but not very often. Either way I would think this is a performance problem of the underlying infra and nothing that can be fixed in the installer (unless additional retries could be added but since a customer hasn't hit probably not necessary).
Is this issue still being encountered or does the suggestion above help? We will review this bug again and close if it is no longer an issue.
Met this again when setup a 4.8.2 cluster https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/32969/console 07-30 08:10:48.824 Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-444830 07-30 08:10:48.824 07-30 08:10:48.824 on upi_on_vsphere-terraform-scripts/vm/main.tf line 1, in resource "vsphere_virtual_machine" "vm": 07-30 08:10:48.824 1: resource "vsphere_virtual_machine" "vm" {
Need to address, keeping same priority.
Waiting for terraform upgrade.
This issue is detected again in QE CI recently 03-01 03:40:57.982 level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-744331 03-01 03:40:57.983 level=error 03-01 03:40:57.983 level=error msg= on ../../../../../../../../tmp/openshift-install-master-1900381675/main.tf line 12, in resource "vsphere_virtual_machine" "vm_master": 03-01 03:40:57.983 level=error msg= 12: resource "vsphere_virtual_machine" "vm_master" { 03-01 03:40:57.983 level=error 03-01 03:40:57.983 level=error 03-01 03:40:57.983 level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change 03-01 03:40:57.983 [ERROR] Installation failed with error code '1'. Aborting execution.
met failures twice on 4.10 from QE side. - 4.10.0-0.nightly-2022-02-24-034852 - 4.10.0-rc.6-x86_64 Searched in CI, failures are most on 4.10 and 4.11. https://search.ci.openshift.org/?search=timed+out+while+waiting+for+%22PowerOnVM_Task%22+to+become+available+for+vm&maxAge=336h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
vSphere Terraform Provider was updated in https://github.com/openshift/installer/pull/5694 which merged on March 16. This appears to still be failing: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5636/pull-ci-openshift-installer-master-e2e-vsphere/1511639662169427968
Met this issue today again on QE CI. 06-25 09:13:14.937 level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-872924 06-25 09:13:14.937 level=error 06-25 09:13:14.937 level=error msg= with vsphere_virtual_machine.vm_bootstrap, 06-25 09:13:14.937 level=error msg= on main.tf line 12, in resource "vsphere_virtual_machine" "vm_bootstrap": 06-25 09:13:14.937 level=error msg= 12: resource "vsphere_virtual_machine" "vm_bootstrap" { 06-25 09:13:14.937 level=error 06-25 09:13:14.937 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 06-25 09:13:14.937 level=error 06-25 09:13:14.937 level=error msg=Error: error powering on virtual machine: blockUntilReadyForMethod: timed out while waiting for "PowerOnVM_Task" to become available for vm vm-872924 06-25 09:13:14.937 level=error 06-25 09:13:14.937 level=error msg= with vsphere_virtual_machine.vm_bootstrap, 06-25 09:13:14.937 level=error msg= on main.tf line 12, in resource "vsphere_virtual_machine" "vm_bootstrap": 06-25 09:13:14.938 level=error msg= 12: resource "vsphere_virtual_machine" "vm_bootstrap" { 06-25 09:13:14.938 level=error 06-25 09:13:14.938 level=error 06-25 09:13:14.938 [ERROR] Installation failed with error code '4'. Aborting execution. 06-25 09:13:14.938 tools/launch_instance.rb:624:in `installation_task': shell command failed execution, see logs (RuntimeError) 06-25 09:13:14.938 from tools/launch_instance.rb:763:in `block in launch_template' 06-25 09:13:14.938 from tools/launch_instance.rb:762:in `each' 06-25 09:13:14.938 from tools/launch_instance.rb:762:in `launch_template' 06-25 09:13:14.938 from tools/launch_instance.rb:55:in `block (2 levels) in run' 06-25 09:13:14.938 from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/command.rb:187:in `call' 06-25 09:13:14.938 from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/command.rb:157:in `run' 06-25 09:13:14.938 from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/runner.rb:444:in `run_active_command' 06-25 09:13:14.938 from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/runner.rb:58:in `run!' 06-25 09:13:14.938 from /home/jenkins/.gem/bundler/ruby/2.7.0/gems/commander-4.6.0/lib/commander/delegates.rb:18:in `run!' 06-25 09:13:14.938 from tools/launch_instance.rb:92:in `run' 06-25 09:13:14.938 from tools/launch_instance.rb:904:in `<main>' 06-25 09:13:14.938 waiting for operation up to 36000 seconds..
** Everyone that has experienced this issue please read below ** 1.) This is a very random and rare issue 2.) Providing log snip-its of the installer tells us nothing new 3.) Logs required: - Which vSphere environment is being used for this install? - vCenter/ESXi logs see https://kb.vmware.com/s/article/653, vcenter and the esxi host where the virtual machine resides - The virtual machine that failed to start - logs (directory where virtual machine exists on datastore) - Full .openshift-install.log - install-config.yaml Without the above information it will be very difficult to determine the vcenter/esxi specific cause and a resolution.
Closing this as WONTFIX. There is not much the installer can do if the vSphere environment's VMs will not power on. We considered adding a bump to the poweron timer, but the default is already 5 minutes, which is essentially a lifetime. We have seen this failure where VMs don't power on for hours, so we don't want to obscure the underlying issue. The error message from terraform is already relatively clear.