Description of problem: Cloning a VM with a Direct LUN results in a cloned VM with a weird disk device in the cloned vm vm_device table. 1. Create a VM with a normal disk from Storage Domain (OS Disk) 2. Attach a Direct LUN to it 3. Deactivate the Direct LUN, but keep it attached The state at this point is like this: vm_id | device_id | device | type | alias | is_plugged | is_managed --------------------------------------+--------------------------------------+--------+------+-----------------------------------------+------------+------------ 8d508df2-8f7d-45b7-919a-765ed0b3f4c6 | 276e4308-6439-4e62-b5e5-c27d98f40456 | disk | disk | ua-276e4308-6439-4e62-b5e5-c27d98f40456 | f | t 8d508df2-8f7d-45b7-919a-765ed0b3f4c6 | faf6935b-c401-4b7f-8c4b-f30b1feb9177 | disk | disk | | t | t 4. Clone this VM, the new VM UUID is 6d2d31f9-0aea-4ca2-9546-da9065b4d7b0 5. Look at the database, there is a new disk '96d7f2e2-f754-4af4-acd7-0121ff5ce39b' with the same alias of the old direct LUN. engine=# select vm_id,device_id,device,type,alias,is_plugged,is_managed from vm_device where vm_id = '6d2d31f9-0aea-4ca2-9546-da9065b4d7b0' and device='disk'; vm_id | device_id | device | type | alias | is_plugged | is_managed --------------------------------------+--------------------------------------+--------+------+-----------------------------------------+------------+------------ 6d2d31f9-0aea-4ca2-9546-da9065b4d7b0 | 96d7f2e2-f754-4af4-acd7-0121ff5ce39b | disk | disk | ua-276e4308-6439-4e62-b5e5-c27d98f40456 | f | t 6d2d31f9-0aea-4ca2-9546-da9065b4d7b0 | 560118d0-7432-4629-9e42-0a316d583279 | disk | disk | | t | t 6. This new disk 96d7f2e2-f754-4af4-acd7-0121ff5ce39b does not exist, its not in base_disks, its not a mapping to a Direct LUN. engine=# select count(*) from base_disks where disk_id = '96d7f2e2-f754-4af4-acd7-0121ff5ce39b'; count ------- 0 engine=# select count(*) from disk_lun_map where disk_id = '96d7f2e2-f754-4af4-acd7-0121ff5ce39b'; count ------- 0 engine=# select count(*) from disk_vm_element where disk_id = '96d7f2e2-f754-4af4-acd7-0121ff5ce39b'; count ------- 0 engine=# select count(*) from images where image_group_id = '96d7f2e2-f754-4af4-acd7-0121ff5ce39b'; count ------- 0 7. Create a snapshot on the cloned VM 8. Preview the snapshot, which will fail. 8.1 For the customer, in 4.2.8, this seems to trigger an NPE here as diskDao.get() likely returns NULL for that device. 470 private boolean deviceCanBeRemoved(VmDevice vmDevice) { 471 if (!vmDevice.getDevice().equals(VmDeviceType.DISK.getName()) || !vmDevice.isManaged()) { 472 return true; 473 } 474 475 return vmDevice.getSnapshotId() == null && diskDao.get(vmDevice.getDeviceId()).isAllowSnapshot(); 476 } 8.2 in RHV 4.4 I got a locked disk and a task that never finishes. Version-Release number of selected component (if applicable): rhvm-4.4.2.6-0.2.el8ev.noarch How reproducible: Always Steps to Reproduce: 1. Create VM 2. Attach Direct LUN to it 3. Deactivate Direct LUN 4. Clone VM -> Ghost disk device in vm_device for the cloned VM 5. Snapshot VM 6. Preview Snapshot -> Seems to hang (4.4) or NPE (4.2 - customer) Actual results: vm_device with device that does not exist Expected results: Cloned VM with no entry in vm_device for a Direct LUN?
Two more notes: 1) I had to trigger the clone via Rest API, see BZ1892525. # cat /tmp/upload.xml <action><vm><name>cloned_vm2</name></vm></action> # curl -vvv -k -u "admin@internal:redhat" -H "Content-type: application/xml" -T /tmp/upload.xml -X POST https://engine.kvm/ovirt-engine/api/vms/8d508df2-8f7d-45b7-919a-765ed0b3f4c6/clone 2) If the Direct LUN is active, the issue still happens, the difference is the disk is "plugged" on the cloned VM and the VM is not image_locked after the snapshot preview, which is still stuck too. vm_id | device_id | device | type | alias | is_plugged | is_managed --------------------------------------+--------------------------------------+--------+------+-----------------------------------------+------------+------------ a5cd6f2d-f9bc-4418-a05e-d71f546c9919 | b55481c3-1f81-426f-be56-2f536b324713 | disk | disk | ua-276e4308-6439-4e62-b5e5-c27d98f40456 | t | t So deactivating the Direct LUN should not be necessary to reproduce.
Assuming the direct LUN is shareable (read-only)..
(In reply to Arik from comment #2) > Assuming the direct LUN is shareable (read-only).. It wasn't necessary to have it shared to reproduce the ghost disk in vm_device table.
(In reply to Germano Veit Michel from comment #3) > (In reply to Arik from comment #2) > > Assuming the direct LUN is shareable (read-only).. > > It wasn't necessary to have it shared to reproduce the ghost disk in > vm_device table. Right, I wrote comment 2 during a meeting in which we discussed possible solutions as a reminder to follow up on that and didn't get to it, sorry. So how about: Fail the clone operation (if the user can drop the LUN disk from the clone-VM dialog/rest-api) or filter out the LUN disk automatically during clone-VM when the LUN is not shareable. Add a proper device for the LUN disk when it's shareable. Would that make sense? (I'm not that familiar with direct LUNs)
Thanks Arik! (In reply to Arik from comment #4) > So how about: > Fail the clone operation (if the user can drop the LUN disk from the > clone-VM dialog/rest-api) Wouldn't this risk breaking cloning from snapshots too? And it it doesn't break it, the LUN from the vm_configuration would still need some logic like below to prevent the problem. Right? > or filter out the LUN disk automatically during > clone-VM when the LUN is not shareable. > Add a proper device for the LUN disk when it's shareable. > Would that make sense? (I'm not that familiar with direct LUNs) I think these make a lot more sense. However currently the LUN device is not cloned to the new VM so I'd say just filter out to avoid a behaviour change, as customers may have automation or procedures already set to attach the shared LUN, or a new LUN to the cloned VM. So IMHO I'd say just filter out on all scenarios.
I don't think it could break clone-from-snapshot because LUN disks are never part of snapshots [1]. Which makes me think that there's a chance this is solved in 4.4.3 already - We've changed clone-vm to first make a snapshot and then clone the disks from the snapshot so LUNs may already be filtered out. But needs to check this. [1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.4.2/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/businessentities/storage/LunDisk.java#L35-L38
(In reply to Arik from comment #6) > I don't think it could break clone-from-snapshot because LUN disks are never > part of snapshots [1]. > Which makes me think that there's a chance this is solved in 4.4.3 already - > We've changed clone-vm to first make a snapshot and then clone the disks > from the snapshot so LUNs may already be filtered out. > But needs to check this. > > [1] > https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.4.2/backend/ > manager/modules/common/src/main/java/org/ovirt/engine/core/common/ > businessentities/storage/LunDisk.java#L35-L38 Thanks. I thought they were part of vm_configuration of the snapshot, which would be read to produce the new VM vm_devices. But if its not there then great :)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.4]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0312