Bug 1892519 - Clone of VM with Direct LUN creates new VM with ghost disk which fails to preview snapshot later.
Summary: Clone of VM with Direct LUN creates new VM with ghost disk which fails to pre...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.4
: ---
Assignee: Shmuel Melamud
QA Contact: Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-29 02:29 UTC by Germano Veit Michel
Modified: 2023-12-15 19:56 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.4.4.3
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-02 13:58:29 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
tamir: testing_plan_complete+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5526811 0 None None None 2020-10-29 04:57:11 UTC
Red Hat Product Errata RHBA-2021:0312 0 None None None 2021-02-02 13:58:44 UTC

Description Germano Veit Michel 2020-10-29 02:29:36 UTC
Description of problem:

Cloning a VM with a Direct LUN results in a cloned VM with a weird disk device in the cloned vm vm_device table.

1. Create a VM with a normal disk from Storage Domain (OS Disk)
2. Attach a Direct LUN to it
3. Deactivate the Direct LUN, but keep it attached

The state at this point is like this:

                vm_id                 |              device_id               | device | type |                  alias                  | is_plugged | is_managed 
--------------------------------------+--------------------------------------+--------+------+-----------------------------------------+------------+------------
 8d508df2-8f7d-45b7-919a-765ed0b3f4c6 | 276e4308-6439-4e62-b5e5-c27d98f40456 | disk   | disk | ua-276e4308-6439-4e62-b5e5-c27d98f40456 | f          | t
 8d508df2-8f7d-45b7-919a-765ed0b3f4c6 | faf6935b-c401-4b7f-8c4b-f30b1feb9177 | disk   | disk |                                         | t          | t

4. Clone this VM, the new VM UUID is 6d2d31f9-0aea-4ca2-9546-da9065b4d7b0

5. Look at the database, there is a new disk '96d7f2e2-f754-4af4-acd7-0121ff5ce39b' with the same alias of the old direct LUN.


engine=# select vm_id,device_id,device,type,alias,is_plugged,is_managed from vm_device where vm_id = '6d2d31f9-0aea-4ca2-9546-da9065b4d7b0' and device='disk';
                vm_id                 |              device_id               | device | type |                  alias                  | is_plugged | is_managed 
--------------------------------------+--------------------------------------+--------+------+-----------------------------------------+------------+------------
 6d2d31f9-0aea-4ca2-9546-da9065b4d7b0 | 96d7f2e2-f754-4af4-acd7-0121ff5ce39b | disk   | disk | ua-276e4308-6439-4e62-b5e5-c27d98f40456 | f          | t
 6d2d31f9-0aea-4ca2-9546-da9065b4d7b0 | 560118d0-7432-4629-9e42-0a316d583279 | disk   | disk |                                         | t          | t

6. This new disk 96d7f2e2-f754-4af4-acd7-0121ff5ce39b does not exist, its not in base_disks, its not a mapping to a Direct LUN.

engine=# select count(*) from base_disks where disk_id = '96d7f2e2-f754-4af4-acd7-0121ff5ce39b';
 count 
-------
     0

engine=# select count(*) from disk_lun_map where disk_id = '96d7f2e2-f754-4af4-acd7-0121ff5ce39b';
 count 
-------
     0

engine=# select count(*) from disk_vm_element where disk_id = '96d7f2e2-f754-4af4-acd7-0121ff5ce39b';
 count 
-------
     0

engine=# select count(*) from images where image_group_id = '96d7f2e2-f754-4af4-acd7-0121ff5ce39b';
 count 
-------
     0

7. Create a snapshot on the cloned VM

8. Preview the snapshot, which will fail.

8.1 For the customer, in 4.2.8, this seems to trigger an NPE here as diskDao.get() likely returns NULL for that device.

   470	    private boolean deviceCanBeRemoved(VmDevice vmDevice) {
   471	        if (!vmDevice.getDevice().equals(VmDeviceType.DISK.getName()) || !vmDevice.isManaged()) {
   472	            return true;
   473	        }
   474	
   475	        return vmDevice.getSnapshotId() == null && diskDao.get(vmDevice.getDeviceId()).isAllowSnapshot();
   476	    }

8.2 in RHV 4.4 I got a locked disk and a task that never finishes.

Version-Release number of selected component (if applicable):
rhvm-4.4.2.6-0.2.el8ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Create VM
2. Attach Direct LUN to it
3. Deactivate Direct LUN
4. Clone VM
   -> Ghost disk device in vm_device for the cloned VM
5. Snapshot VM
6. Preview Snapshot
   -> Seems to hang (4.4) or NPE (4.2 - customer)

Actual results:
vm_device with device that does not exist

Expected results:
Cloned VM with no entry in vm_device for a Direct LUN?

Comment 1 Germano Veit Michel 2020-10-29 03:56:58 UTC
Two more notes:

1) I had to trigger the clone via Rest API, see BZ1892525.
# cat /tmp/upload.xml 
<action><vm><name>cloned_vm2</name></vm></action>
# curl -vvv -k -u "admin@internal:redhat" -H "Content-type: application/xml" -T /tmp/upload.xml -X POST https://engine.kvm/ovirt-engine/api/vms/8d508df2-8f7d-45b7-919a-765ed0b3f4c6/clone

2) If the Direct LUN is active, the issue still happens, the difference is the disk is "plugged" on the cloned VM and the VM is not image_locked after the snapshot preview, which is still stuck too.

                vm_id                 |              device_id               | device | type |                  alias                  | is_plugged | is_managed 
--------------------------------------+--------------------------------------+--------+------+-----------------------------------------+------------+------------
 a5cd6f2d-f9bc-4418-a05e-d71f546c9919 | b55481c3-1f81-426f-be56-2f536b324713 | disk   | disk | ua-276e4308-6439-4e62-b5e5-c27d98f40456 | t          | t

So deactivating the Direct LUN should not be necessary to reproduce.

Comment 2 Arik 2020-11-02 12:21:15 UTC
Assuming the direct LUN is shareable (read-only)..

Comment 3 Germano Veit Michel 2020-11-02 21:35:09 UTC
(In reply to Arik from comment #2)
> Assuming the direct LUN is shareable (read-only)..

It wasn't necessary to have it shared to reproduce the ghost disk in vm_device table.

Comment 4 Arik 2020-11-03 10:38:16 UTC
(In reply to Germano Veit Michel from comment #3)
> (In reply to Arik from comment #2)
> > Assuming the direct LUN is shareable (read-only)..
> 
> It wasn't necessary to have it shared to reproduce the ghost disk in
> vm_device table.

Right, I wrote comment 2 during a meeting in which we discussed possible solutions as a reminder to follow up on that and didn't get to it, sorry.
So how about:
Fail the clone operation (if the user can drop the LUN disk from the clone-VM dialog/rest-api) or filter out the LUN disk automatically during clone-VM when the LUN is not shareable.
Add a proper device for the LUN disk when it's shareable.
Would that make sense? (I'm not that familiar with direct LUNs)

Comment 5 Germano Veit Michel 2020-11-03 21:37:36 UTC
Thanks Arik!

(In reply to Arik from comment #4)
> So how about:
> Fail the clone operation (if the user can drop the LUN disk from the
> clone-VM dialog/rest-api)

Wouldn't this risk breaking cloning from snapshots too? And it it doesn't break it,
the LUN from the vm_configuration would still need some logic like below to prevent the problem.
Right?

> or filter out the LUN disk automatically during
> clone-VM when the LUN is not shareable.
> Add a proper device for the LUN disk when it's shareable.
> Would that make sense? (I'm not that familiar with direct LUNs)

I think these make a lot more sense. However currently the LUN device
is not cloned to the new VM so I'd say just filter out to avoid
a behaviour change, as customers may have automation or procedures
already set to attach the shared LUN, or a new LUN to the cloned VM.

So IMHO I'd say just filter out on all scenarios.

Comment 6 Arik 2020-11-08 20:02:31 UTC
I don't think it could break clone-from-snapshot because LUN disks are never part of snapshots [1].
Which makes me think that there's a chance this is solved in 4.4.3 already -
We've changed clone-vm to first make a snapshot and then clone the disks from the snapshot so LUNs may already be filtered out.
But needs to check this.

[1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.4.2/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/businessentities/storage/LunDisk.java#L35-L38

Comment 7 Germano Veit Michel 2020-11-08 21:47:30 UTC
(In reply to Arik from comment #6)
> I don't think it could break clone-from-snapshot because LUN disks are never
> part of snapshots [1].
> Which makes me think that there's a chance this is solved in 4.4.3 already -
> We've changed clone-vm to first make a snapshot and then clone the disks
> from the snapshot so LUNs may already be filtered out.
> But needs to check this.
> 
> [1]
> https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.4.2/backend/
> manager/modules/common/src/main/java/org/ovirt/engine/core/common/
> businessentities/storage/LunDisk.java#L35-L38

Thanks. I thought they were part of vm_configuration of the snapshot,
which would be read to produce the new VM vm_devices. But if its not
there then great :)

Comment 14 errata-xmlrpc 2021-02-02 13:58:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.4]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0312


Note You need to log in before you can comment on or make changes to this bug.