Bug 1411118

Summary: HA vm's disks live storage migration fails (depends on libvirt bug 1406765 )
Product: [oVirt] vdsm Reporter: Lilach Zitnitski <lzitnits>
Component: CoreAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Lilach Zitnitski <lzitnits>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.19.1CC: amureini, bugs, lzitnits, nsoffer, tnisan
Target Milestone: ovirt-4.1.1Flags: rule-engine: ovirt-4.1+
rule-engine: planning_ack+
amureini: devel_ack+
ratamir: testing_ack+
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.19.3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-21 09:38:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1406765, 1415488    
Bug Blocks:    
Attachments:
Description Flags
logs none

Description Lilach Zitnitski 2017-01-08 14:21:27 UTC
Description of problem:
When trying to move disks between storage domains and those disks attached to high available vm with vm lease, the vm pauses during the live storage migration, auto-generated snapshot for LSM deletion fails, and the whole lsm process fails. 

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.4.master.20170105161132.gitf4e2c11.el7.centos.noarch
vdsm-4.19.1-18.git79e5ea5.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. create vm with disk and create lease for this vm on one of the storage domains.
2. start the vm
3. while vm is up, move the disk to another storage domain 

Actual results:
vm becomes paused, auto-generated snapshot for lsm fails to be deleted and the disk we wanted to move, stays in the same storage domain. 

Expected results:
vm should stay up during the whole process, and disk needs to move to the wanted storage domain and the auto-generated snapshot should be deleted when the lsm is over. 

Additional info:
I've tested this with regular vm (without the lease) and the lsm completed successfully.

engine.log

lsm started on 2017-01-08 15:46:02,590+02 INFO  [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler10) [1f14fe28] Running command: LiveMigrateDiskCommand interna
l: true. Entities affected :  ID: e6c3210e-040c-46cf-91eb-997fadfd86f8 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER,  ID: 4e25089f-65eb-4934-ad19-2ee672499b8a T
ype: StorageAction group CREATE_DISK with role type USER

2017-01-08 15:47:13,244+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [69c98e68] Correlation ID: 1f14fe28, Job ID: ddfed6
20-0485-4cd2-82a8-92fd9999a026, Call Stack: null, Custom Event ID: -1, Message: User admin@internal-authz have failed to move disk lsm-test1_Disk1 to domain data_nfs1.

2017-01-08 15:48:56,638+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [1b1e3031] Correlation ID: 535ce1d2, Job ID: c9e911
18-3a3c-4bc8-ae50-9ba2a4e12921, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'Auto-generated for Live Storage Migration' for VM 'lsm-test1'.

Comment 1 Lilach Zitnitski 2017-01-08 14:22:17 UTC
Created attachment 1238393 [details]
logs

engine and vdsm

Comment 2 Yaniv Kaul 2017-01-09 07:16:07 UTC
Lilach, I assume the problematic area is not in Engine but in VDSM:
2017-01-08 15:15:49,213 INFO  (libvirt/events) [virt.vm] (vmId='e5477f2b-6c09-4528-a0d4-1fe3358d7b47') CPU stopped: onSuspend (vm:4865)
2017-01-08 15:15:49,214 ERROR (jsonrpc/0) [virt.vm] (vmId='e5477f2b-6c09-4528-a0d4-1fe3358d7b47') Unable to take snapshot (vm:3506)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 3503, in snapshot
    self._dom.snapshotCreateXML(snapxml, snapFlags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2737, in snapshotCreateXML
    if ret is None:raise libvirtError('virDomainSnapshotCreateXML() failed', dom=self)
libvirtError: Failed to acquire lock: File exists

Comment 3 Nir Soffer 2017-01-09 10:55:46 UTC
This requires libvirt-2.0.0-10.el7_3.4. Vdsm does not require it yet, since
it was not released yet.

Making this bug depend on the libvirt bug 1403691.

Comment 4 Yaniv Kaul 2017-02-12 14:01:00 UTC
Since the relevant libvirt fix was released on z-stream, can you re-test?

Comment 5 Lilach Zitnitski 2017-02-13 09:48:20 UTC
(In reply to Yaniv Kaul from comment #4)
> Since the relevant libvirt fix was released on z-stream, can you re-test?

When testing on 4.1.1 - the vm stays active during the process and the lsm completes.

Comment 6 Allon Mureinik 2017-02-13 10:28:40 UTC
(In reply to Yaniv Kaul from comment #4)
> Since the relevant libvirt fix was released on z-stream, can you re-test?

(In reply to Lilach Zitnitski from comment #5)
> (In reply to Yaniv Kaul from comment #4)
> > Since the relevant libvirt fix was released on z-stream, can you re-test?
> 
> When testing on 4.1.1 - the vm stays active during the process and the lsm
> completes.

VDSM 4.1.1. already requires a libvirt version that fixes this issue (2.0.0-10.el7_3.4) since 4.19.3.

Moving to ON_QA.

Lilach - up to you whether your comment 5 counts as verification or whether you want to perform and other tests.

Comment 7 Lilach Zitnitski 2017-02-13 12:31:35 UTC
Tested with rhevm 4.1.1 vdsm 4.19.5 and got the expected results - moving to VERIFIED.