Description of problem: When trying to move disks between storage domains and those disks attached to high available vm with vm lease, the vm pauses during the live storage migration, auto-generated snapshot for LSM deletion fails, and the whole lsm process fails. Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.4.master.20170105161132.gitf4e2c11.el7.centos.noarch vdsm-4.19.1-18.git79e5ea5.el7.centos.x86_64 How reproducible: 100% Steps to Reproduce: 1. create vm with disk and create lease for this vm on one of the storage domains. 2. start the vm 3. while vm is up, move the disk to another storage domain Actual results: vm becomes paused, auto-generated snapshot for lsm fails to be deleted and the disk we wanted to move, stays in the same storage domain. Expected results: vm should stay up during the whole process, and disk needs to move to the wanted storage domain and the auto-generated snapshot should be deleted when the lsm is over. Additional info: I've tested this with regular vm (without the lease) and the lsm completed successfully. engine.log lsm started on 2017-01-08 15:46:02,590+02 INFO [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler10) [1f14fe28] Running command: LiveMigrateDiskCommand interna l: true. Entities affected : ID: e6c3210e-040c-46cf-91eb-997fadfd86f8 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER, ID: 4e25089f-65eb-4934-ad19-2ee672499b8a T ype: StorageAction group CREATE_DISK with role type USER 2017-01-08 15:47:13,244+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [69c98e68] Correlation ID: 1f14fe28, Job ID: ddfed6 20-0485-4cd2-82a8-92fd9999a026, Call Stack: null, Custom Event ID: -1, Message: User admin@internal-authz have failed to move disk lsm-test1_Disk1 to domain data_nfs1. 2017-01-08 15:48:56,638+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [1b1e3031] Correlation ID: 535ce1d2, Job ID: c9e911 18-3a3c-4bc8-ae50-9ba2a4e12921, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'Auto-generated for Live Storage Migration' for VM 'lsm-test1'.
Created attachment 1238393 [details] logs engine and vdsm
Lilach, I assume the problematic area is not in Engine but in VDSM: 2017-01-08 15:15:49,213 INFO (libvirt/events) [virt.vm] (vmId='e5477f2b-6c09-4528-a0d4-1fe3358d7b47') CPU stopped: onSuspend (vm:4865) 2017-01-08 15:15:49,214 ERROR (jsonrpc/0) [virt.vm] (vmId='e5477f2b-6c09-4528-a0d4-1fe3358d7b47') Unable to take snapshot (vm:3506) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 3503, in snapshot self._dom.snapshotCreateXML(snapxml, snapFlags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2737, in snapshotCreateXML if ret is None:raise libvirtError('virDomainSnapshotCreateXML() failed', dom=self) libvirtError: Failed to acquire lock: File exists
This requires libvirt-2.0.0-10.el7_3.4. Vdsm does not require it yet, since it was not released yet. Making this bug depend on the libvirt bug 1403691.
Since the relevant libvirt fix was released on z-stream, can you re-test?
(In reply to Yaniv Kaul from comment #4) > Since the relevant libvirt fix was released on z-stream, can you re-test? When testing on 4.1.1 - the vm stays active during the process and the lsm completes.
(In reply to Yaniv Kaul from comment #4) > Since the relevant libvirt fix was released on z-stream, can you re-test? (In reply to Lilach Zitnitski from comment #5) > (In reply to Yaniv Kaul from comment #4) > > Since the relevant libvirt fix was released on z-stream, can you re-test? > > When testing on 4.1.1 - the vm stays active during the process and the lsm > completes. VDSM 4.1.1. already requires a libvirt version that fixes this issue (2.0.0-10.el7_3.4) since 4.19.3. Moving to ON_QA. Lilach - up to you whether your comment 5 counts as verification or whether you want to perform and other tests.
Tested with rhevm 4.1.1 vdsm 4.19.5 and got the expected results - moving to VERIFIED.