Bug 2162095

Summary: [vSphere on Nutanix] cannot change the value of "thin_provisioned" - (old: true newValue: false)
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Tyler Bevan <tyler.bevan>
Component: Cluster LifecycleAssignee: Le Yang <leyan>
Status: NEW --- QA Contact: Hui Chen <huichen>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhacm-2.6CC: daliu, dhuynh, jcallen, padillon
Target Milestone: ---Flags: padillon: needinfo-
tyler.bevan: needinfo? (daliu)
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tyler Bevan 2023-01-18 18:57:02 UTC
Description of the problem:
Issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1862290 appears to affect ACM's installer on ACM 2.6 and OCP 4.11.
vSphere provider for terraform fails to delete a vm when it was cloned from a thick provisioned template on a Nutanix storage backend.

Release version: 2.6.3

Operator snapshot version: 2.6.3

OCP version: 4.11.18

Browser Info: Firefox 108

Steps to reproduce:
1. Deploy a new cluster via ACM using the vsphere provider against a cluster with nutanix backed vmware storage.
2. Installation deploys cluster fine, but the job fails when it can't remove the bootstrap node.

Actual results: Installation failure on bootstrap cleanup.

Expected results: Installation success and import into acm management.

Additional info:
Logs from the hive container:

time="2023-01-18T17:24:29Z" level=debug msg="Bootstrap status: complete"
time="2023-01-18T17:24:29Z" level=info msg="Destroying the bootstrap resources..."
time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/bin/terraform file"
time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/plugins/openshift/local/vsphere directory"
time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/plugins/openshift/local/vsphere/terraform-provider-vsphere_1.0.0_linux_amd64.zip file"
time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/plugins/openshift/local/vsphereprivate directory"
time="2023-01-18T17:24:29Z" level=debug msg="creating /output/terraform/plugins/openshift/local/vsphereprivate/terraform-provider-vsphereprivate_1.0.0_linux_amd64.zip file"
time="2023-01-18T17:24:29Z" level=debug msg="[INFO] running Terraform command: /output/terraform/bin/terraform version -json"
time="2023-01-18T17:24:29Z" level=debug msg="{"
time="2023-01-18T17:24:29Z" level=debug msg="  \"terraform_version\": \"1.0.11\","
time="2023-01-18T17:24:29Z" level=debug msg="  \"platform\": \"linux_amd64\","
time="2023-01-18T17:24:29Z" level=debug msg="  \"provider_selections\": {},"
time="2023-01-18T17:24:29Z" level=debug msg="  \"terraform_outdated\": true"
time="2023-01-18T17:24:29Z" level=debug msg="}"
time="2023-01-18T17:24:29Z" level=debug msg="[INFO] running Terraform command: /output/terraform/bin/terraform init -no-color -force-copy -input=false -backend=true -get=true -upgrade=false -plugin-dir=/output/terraform/plugins"
time="2023-01-18T17:24:29Z" level=debug
time="2023-01-18T17:24:29Z" level=debug msg="Initializing the backend..."
time="2023-01-18T17:24:29Z" level=debug
time="2023-01-18T17:24:29Z" level=debug msg="Initializing provider plugins..."
time="2023-01-18T17:24:29Z" level=debug msg="- Finding latest version of openshift/local/vsphere..."
time="2023-01-18T17:24:29Z" level=debug msg="- Installing openshift/local/vsphere v1.0.0..."
time="2023-01-18T17:24:29Z" level=debug msg="- Installed openshift/local/vsphere v1.0.0 (unauthenticated)"
time="2023-01-18T17:24:29Z" level=debug
time="2023-01-18T17:24:29Z" level=debug msg="Terraform has created a lock file .terraform.lock.hcl to record the provider"
time="2023-01-18T17:24:29Z" level=debug msg="selections it made above. Include this file in your version control repository"
time="2023-01-18T17:24:29Z" level=debug msg="so that Terraform can guarantee to make the same selections by default when"
time="2023-01-18T17:24:29Z" level=debug msg="you run \"terraform init\" in the future."
time="2023-01-18T17:24:29Z" level=debug
time="2023-01-18T17:24:29Z" level=debug msg="Terraform has been successfully initialized!"
time="2023-01-18T17:24:29Z" level=debug msg="[INFO] running Terraform command: /output/terraform/bin/terraform destroy -no-color -auto-approve -input=false -lock-timeout=0s -var-file=/tmp/openshift-install-bootstrap-2782656941/terraform.tfvars.json -var-file=/tmp/openshift-install-bootstrap-2782656941/terraform.platform.auto.tfvars.json -var-file=/tmp/openshift-install-bootstrap-2782656941/pre-bootstrap.tfvars.json -var-file=/tmp/openshift-install-bootstrap-2782656941/bootstrap.tfvars.json -var-file=/tmp/openshift-install-bootstrap-2782656941/master.tfvars.json -lock=true -parallelism=10 -refresh=true"
time="2023-01-18T17:24:31Z" level=debug msg="vsphere_virtual_machine.vm_bootstrap: Refreshing state... [id=421986c6-c60c-6c6f-024c-a41506351310]"
time="2023-01-18T17:24:31Z" level=error
time="2023-01-18T17:24:31Z" level=error msg="Error: disk.0: virtual disk \"disk0\": cannot change the value of \"thin_provisioned\" - (old: true newValue: false)"
time="2023-01-18T17:24:31Z" level=error
time="2023-01-18T17:24:31Z" level=error msg="  with vsphere_virtual_machine.vm_bootstrap,"
time="2023-01-18T17:24:31Z" level=error msg="  on main.tf line 12, in resource \"vsphere_virtual_machine\" \"vm_bootstrap\":"
time="2023-01-18T17:24:31Z" level=error msg="  12: resource \"vsphere_virtual_machine\" \"vm_bootstrap\" {"
time="2023-01-18T17:24:31Z" level=error
time="2023-01-18T17:24:31Z" level=fatal msg="terraform destroy: failed doing terraform destroy: exit status 1\n\nError: disk.0: virtual disk \"disk0\": cannot change the value of \"thin_provisioned\" - (old: true newValue: false)\n\n  with vsphere_virtual_machine.vm_bootstrap,\n  on main.tf line 12, in resource \"vsphere_virtual_machine\" \"vm_bootstrap\":\n  12: resource \"vsphere_virtual_machine\" \"vm_bootstrap\" {\n\n"
time="2023-01-18T17:24:32Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=mpwn4n4j
time="2023-01-18T17:24:32Z" level=error msg="error provisioning cluster" error="exit status 1" installID=mpwn4n4j
time="2023-01-18T17:24:32Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 1" installID=mpwn4n4j

Comment 1 Jakob 2023-01-24 16:24:09 UTC
Retagging to cluster-lifecycle as it relates to cluster provisioning

Comment 2 daliu 2023-01-29 02:39:47 UTC
@efried Could you help to take a look?

Comment 3 Eric Fried 2023-01-30 19:03:02 UTC
This falls squarely in the installer's wheelhouse. @padillon could you find a pair of eyes for this please?

Comment 4 daliu 2023-02-01 09:17:27 UTC
@tyler.bevan 
I have create a discussion in openshift-installer channel about this issue. https://redhat-internal.slack.com/archives/CH06KMDRV/p1675213341255269
And Could you help to try the workaround.

"Change install-config diskType to thin"
"Making the assumption there is a storage policy that is forcing thin underneath terraform. Which it does not support well"

Note: In ACM env, in order to update install-config, you need to update the secret named <Cluster Namespace>/<Cluster Namespace>-install-config

Comment 5 Tyler Bevan 2023-02-01 19:43:31 UTC
An install attempt using "diskType: thin" on both machine pools did not resolve the issue. The error message was identical to the original log.
I don't see any way to customize the bootstrap vm's specifications in the documentation for install-config, which seems to be where the failure is.

As a side note, the documentation at https://github.com/openshift/installer/blob/master/docs/user/vsphere/customization.md#machine-pools is a bit unclear on if the key is disk_type or diskType. I did test it both ways to be sure.

Tyler

Comment 6 Tyler Bevan 2023-02-01 20:11:28 UTC
For context here is the sanitized install-config:

---
apiVersion: v1
metadata:
  name: ocp-lab
baseDomain:xxxxxxxxxx
controlPlane:
  name: master
  architecture: amd64
  hyperthreading: Enabled
  replicas: 3
  platform:
    vsphere:
      cpus: 8
      coresPerSocket: 8
      memoryMB: 16384
      osDisk:
        diskSizeGB: 120
      disk_type: thin
compute:
  - name: worker
    hyperthreading: Enabled
    architecture: amd64
    replicas: 0
    platform:
      vsphere:
        cpus: 4
        coresPerSocket: 2
        memoryMB: 16384
        osDisk:
          diskSizeGB: 120
        disk_type: thin
platform:
  vsphere:
    vCenter: xxxxxxxxxxxx
    username: xxxxxxxxxx
    password: xxxxxxxxxxxxxxxxxxx
    datacenter: xxxxxxxxxxxxxxx
    defaultDatastore: xxxxxxxxxxxx
    folder: /xxxxxxxx/vm/Openshift/ocp-lab
    cluster: xxxxxxxx
    apiVIP: 10.10.x.x
    ingressVIP: 10.10.x.x
    network: xxxxxxxxxxxxxxxxx

Comment 7 daliu 2023-02-02 02:38:03 UTC
@jcallen Any more suggestion ?

Comment 10 Joseph Callen 2023-02-10 18:27:12 UTC
Sorry I didn't see the reporter was not RH employee, repeating what I already stated in a private comment...

The install-config was incorrect, diskType is not at the machinepool, it is in the platform spec

https://github.com/openshift/installer/blob/master/pkg/types/vsphere/platform.go#L92

e.g.
platform:
  vsphere:
    vCenter: xxxxxxxxxxxx
    username: xxxxxxxxxx
    password: xxxxxxxxxxxxxxxxxxx
    datacenter: xxxxxxxxxxxxxxx
    defaultDatastore: xxxxxxxxxxxx
    folder: /xxxxxxxx/vm/Openshift/ocp-lab
    cluster: xxxxxxxx
    apiVIP: 10.10.x.x
    ingressVIP: 10.10.x.x
    network: xxxxxxxxxxxxxxxxx
    diskType: thin

If this still fails we will need a bug created in Jira assigned to installer. Please link the bug here.
We previously had issues with the vsphere terraform provider but I had thought those were resolved.


This is also probably related to: https://kb.vmware.com/s/article/68107
Changing state of objects that terraform created is a good way of having a failure.

Comment 11 daliu 2023-02-13 01:18:36 UTC
@tyler.bevan Could you help to try again follow https://bugzilla.redhat.com/show_bug.cgi?id=2162095#c10

Comment 12 Tyler Bevan 2023-02-15 16:36:23 UTC
@daliu That change does appear to fix the problem, as the provision finished as expected.
So, should we just presume that if you're on VMWare with a Nutanix storage backend that adding diskType: thin to the platform spec is mandatory?

Comment 13 Joseph Callen 2023-02-15 16:48:53 UTC
I can put a bug in to fix it (will be changed for 4.13 once the PR below is merged)

Just a single line change for terraform to ignore disk type state issue.

https://github.com/openshift/installer/pull/6770/files#diff-b4dbed356c5acdaefe0d1716089c2fa5efacfa5cb6ca4ad2000e1b5a5ddb7194R55

Comment 14 Joseph Callen 2023-02-15 18:18:59 UTC
Not the owner of this BZ I believe it can be closed. I will work this one in jira:
https://issues.redhat.com/browse/OCPBUGS-7551