Description of the problem: Issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1862290 appears to affect ACM's installer on ACM 2.6 and OCP 4.11. vSphere provider for terraform fails to delete a vm when it was cloned from a thick provisioned template on a Nutanix storage backend. Release version: 2.6.3 Operator snapshot version: 2.6.3 OCP version: 4.11.18 Browser Info: Firefox 108 Steps to reproduce: 1. Deploy a new cluster via ACM using the vsphere provider against a cluster with nutanix backed vmware storage. 2. Installation deploys cluster fine, but the job fails when it can't remove the bootstrap node. Actual results: Installation failure on bootstrap cleanup. Expected results: Installation success and import into acm management. Additional info: Logs from the hive container: time="2023-01-18T17:24:29Z" level=debug msg="Bootstrap status: complete" time="2023-01-18T17:24:29Z" level=info msg="Destroying the bootstrap resources..." time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/bin/terraform file" time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/plugins/openshift/local/vsphere directory" time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/plugins/openshift/local/vsphere/terraform-provider-vsphere_1.0.0_linux_amd64.zip file" time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/plugins/openshift/local/vsphereprivate directory" time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/plugins/openshift/local/vsphereprivate/terraform-provider-vsphereprivate_1.0.0_linux_amd64.zip file" time="2023-01-18T17:24:29Z" level=debug msg="[INFO] running Terraform command: /output/terraform/bin/terraform version -json" time="2023-01-18T17:24:29Z" level=debug msg="{" time="2023-01-18T17:24:29Z" level=debug msg=" \"terraform_version\": \"1.0.11\"," time="2023-01-18T17:24:29Z" level=debug msg=" \"platform\": \"linux_amd64\"," time="2023-01-18T17:24:29Z" level=debug msg=" \"provider_selections\": {}," time="2023-01-18T17:24:29Z" level=debug msg=" \"terraform_outdated\": true" time="2023-01-18T17:24:29Z" level=debug msg="}" time="2023-01-18T17:24:29Z" level=debug msg="[INFO] running Terraform command: /output/terraform/bin/terraform init -no-color -force-copy -input=false -backend=true -get=true -upgrade=false -plugin-dir=/output/terraform/plugins" time="2023-01-18T17:24:29Z" level=debug time="2023-01-18T17:24:29Z" level=debug msg="Initializing the backend..." time="2023-01-18T17:24:29Z" level=debug time="2023-01-18T17:24:29Z" level=debug msg="Initializing provider plugins..." time="2023-01-18T17:24:29Z" level=debug msg="- Finding latest version of openshift/local/vsphere..." time="2023-01-18T17:24:29Z" level=debug msg="- Installing openshift/local/vsphere v1.0.0..." time="2023-01-18T17:24:29Z" level=debug msg="- Installed openshift/local/vsphere v1.0.0 (unauthenticated)" time="2023-01-18T17:24:29Z" level=debug time="2023-01-18T17:24:29Z" level=debug msg="Terraform has created a lock file .terraform.lock.hcl to record the provider" time="2023-01-18T17:24:29Z" level=debug msg="selections it made above. Include this file in your version control repository" time="2023-01-18T17:24:29Z" level=debug msg="so that Terraform can guarantee to make the same selections by default when" time="2023-01-18T17:24:29Z" level=debug msg="you run \"terraform init\" in the future." time="2023-01-18T17:24:29Z" level=debug time="2023-01-18T17:24:29Z" level=debug msg="Terraform has been successfully initialized!" time="2023-01-18T17:24:29Z" level=debug msg="[INFO] running Terraform command: /output/terraform/bin/terraform destroy -no-color -auto-approve -input=false -lock-timeout=0s -var-file=/tmp/openshift-install-bootstrap-2782656941/terraform.tfvars.json -var-file=/tmp/openshift-install-bootstrap-2782656941/terraform.platform.auto.tfvars.json -var-file=/tmp/openshift-install-bootstrap-2782656941/pre-bootstrap.tfvars.json -var-file=/tmp/openshift-install-bootstrap-2782656941/bootstrap.tfvars.json -var-file=/tmp/openshift-install-bootstrap-2782656941/master.tfvars.json -lock=true -parallelism=10 -refresh=true" time="2023-01-18T17:24:31Z" level=debug msg="vsphere_virtual_machine.vm_bootstrap: Refreshing state... [id=421986c6-c60c-6c6f-024c-a41506351310]" time="2023-01-18T17:24:31Z" level=error time="2023-01-18T17:24:31Z" level=error msg="Error: disk.0: virtual disk \"disk0\": cannot change the value of \"thin_provisioned\" - (old: true newValue: false)" time="2023-01-18T17:24:31Z" level=error time="2023-01-18T17:24:31Z" level=error msg=" with vsphere_virtual_machine.vm_bootstrap," time="2023-01-18T17:24:31Z" level=error msg=" on main.tf line 12, in resource \"vsphere_virtual_machine\" \"vm_bootstrap\":" time="2023-01-18T17:24:31Z" level=error msg=" 12: resource \"vsphere_virtual_machine\" \"vm_bootstrap\" {" time="2023-01-18T17:24:31Z" level=error time="2023-01-18T17:24:31Z" level=fatal msg="terraform destroy: failed doing terraform destroy: exit status 1\n\nError: disk.0: virtual disk \"disk0\": cannot change the value of \"thin_provisioned\" - (old: true newValue: false)\n\n with vsphere_virtual_machine.vm_bootstrap,\n on main.tf line 12, in resource \"vsphere_virtual_machine\" \"vm_bootstrap\":\n 12: resource \"vsphere_virtual_machine\" \"vm_bootstrap\" {\n\n" time="2023-01-18T17:24:32Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=mpwn4n4j time="2023-01-18T17:24:32Z" level=error msg="error provisioning cluster" error="exit status 1" installID=mpwn4n4j time="2023-01-18T17:24:32Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 1" installID=mpwn4n4j
Retagging to cluster-lifecycle as it relates to cluster provisioning
@efried Could you help to take a look?
This falls squarely in the installer's wheelhouse. @padillon could you find a pair of eyes for this please?
@tyler.bevan I have create a discussion in openshift-installer channel about this issue. https://redhat-internal.slack.com/archives/CH06KMDRV/p1675213341255269 And Could you help to try the workaround. "Change install-config diskType to thin" "Making the assumption there is a storage policy that is forcing thin underneath terraform. Which it does not support well" Note: In ACM env, in order to update install-config, you need to update the secret named <Cluster Namespace>/<Cluster Namespace>-install-config
An install attempt using "diskType: thin" on both machine pools did not resolve the issue. The error message was identical to the original log. I don't see any way to customize the bootstrap vm's specifications in the documentation for install-config, which seems to be where the failure is. As a side note, the documentation at https://github.com/openshift/installer/blob/master/docs/user/vsphere/customization.md#machine-pools is a bit unclear on if the key is disk_type or diskType. I did test it both ways to be sure. Tyler
For context here is the sanitized install-config: --- apiVersion: v1 metadata: name: ocp-lab baseDomain:xxxxxxxxxx controlPlane: name: master architecture: amd64 hyperthreading: Enabled replicas: 3 platform: vsphere: cpus: 8 coresPerSocket: 8 memoryMB: 16384 osDisk: diskSizeGB: 120 disk_type: thin compute: - name: worker hyperthreading: Enabled architecture: amd64 replicas: 0 platform: vsphere: cpus: 4 coresPerSocket: 2 memoryMB: 16384 osDisk: diskSizeGB: 120 disk_type: thin platform: vsphere: vCenter: xxxxxxxxxxxx username: xxxxxxxxxx password: xxxxxxxxxxxxxxxxxxx datacenter: xxxxxxxxxxxxxxx defaultDatastore: xxxxxxxxxxxx folder: /xxxxxxxx/vm/Openshift/ocp-lab cluster: xxxxxxxx apiVIP: 10.10.x.x ingressVIP: 10.10.x.x network: xxxxxxxxxxxxxxxxx
@jcallen Any more suggestion ?
Sorry I didn't see the reporter was not RH employee, repeating what I already stated in a private comment... The install-config was incorrect, diskType is not at the machinepool, it is in the platform spec https://github.com/openshift/installer/blob/master/pkg/types/vsphere/platform.go#L92 e.g. platform: vsphere: vCenter: xxxxxxxxxxxx username: xxxxxxxxxx password: xxxxxxxxxxxxxxxxxxx datacenter: xxxxxxxxxxxxxxx defaultDatastore: xxxxxxxxxxxx folder: /xxxxxxxx/vm/Openshift/ocp-lab cluster: xxxxxxxx apiVIP: 10.10.x.x ingressVIP: 10.10.x.x network: xxxxxxxxxxxxxxxxx diskType: thin If this still fails we will need a bug created in Jira assigned to installer. Please link the bug here. We previously had issues with the vsphere terraform provider but I had thought those were resolved. This is also probably related to: https://kb.vmware.com/s/article/68107 Changing state of objects that terraform created is a good way of having a failure.
@tyler.bevan Could you help to try again follow https://bugzilla.redhat.com/show_bug.cgi?id=2162095#c10
@daliu That change does appear to fix the problem, as the provision finished as expected. So, should we just presume that if you're on VMWare with a Nutanix storage backend that adding diskType: thin to the platform spec is mandatory?
I can put a bug in to fix it (will be changed for 4.13 once the PR below is merged) Just a single line change for terraform to ignore disk type state issue. https://github.com/openshift/installer/pull/6770/files#diff-b4dbed356c5acdaefe0d1716089c2fa5efacfa5cb6ca4ad2000e1b5a5ddb7194R55
Not the owner of this BZ I believe it can be closed. I will work this one in jira: https://issues.redhat.com/browse/OCPBUGS-7551