Bug 1411118 - HA vm's disks live storage migration fails (depends on libvirt bug 1406765 )
Summary: HA vm's disks live storage migration fails (depends on libvirt bug 1406765 )
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.19.1
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.1.1
: ---
Assignee: Nir Soffer
QA Contact: Lilach Zitnitski
URL:
Whiteboard:
Depends On: 1406765 1415488
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-08 14:21 UTC by Lilach Zitnitski
Modified: 2017-04-21 09:38 UTC (History)
5 users (show)

Fixed In Version: 4.19.3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-21 09:38:23 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: planning_ack+
amureini: devel_ack+
ratamir: testing_ack+


Attachments (Terms of Use)
logs (204.77 KB, application/zip)
2017-01-08 14:22 UTC, Lilach Zitnitski
no flags Details

Description Lilach Zitnitski 2017-01-08 14:21:27 UTC
Description of problem:
When trying to move disks between storage domains and those disks attached to high available vm with vm lease, the vm pauses during the live storage migration, auto-generated snapshot for LSM deletion fails, and the whole lsm process fails. 

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.4.master.20170105161132.gitf4e2c11.el7.centos.noarch
vdsm-4.19.1-18.git79e5ea5.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. create vm with disk and create lease for this vm on one of the storage domains.
2. start the vm
3. while vm is up, move the disk to another storage domain 

Actual results:
vm becomes paused, auto-generated snapshot for lsm fails to be deleted and the disk we wanted to move, stays in the same storage domain. 

Expected results:
vm should stay up during the whole process, and disk needs to move to the wanted storage domain and the auto-generated snapshot should be deleted when the lsm is over. 

Additional info:
I've tested this with regular vm (without the lease) and the lsm completed successfully.

engine.log

lsm started on 2017-01-08 15:46:02,590+02 INFO  [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler10) [1f14fe28] Running command: LiveMigrateDiskCommand interna
l: true. Entities affected :  ID: e6c3210e-040c-46cf-91eb-997fadfd86f8 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER,  ID: 4e25089f-65eb-4934-ad19-2ee672499b8a T
ype: StorageAction group CREATE_DISK with role type USER

2017-01-08 15:47:13,244+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [69c98e68] Correlation ID: 1f14fe28, Job ID: ddfed6
20-0485-4cd2-82a8-92fd9999a026, Call Stack: null, Custom Event ID: -1, Message: User admin@internal-authz have failed to move disk lsm-test1_Disk1 to domain data_nfs1.

2017-01-08 15:48:56,638+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [1b1e3031] Correlation ID: 535ce1d2, Job ID: c9e911
18-3a3c-4bc8-ae50-9ba2a4e12921, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'Auto-generated for Live Storage Migration' for VM 'lsm-test1'.

Comment 1 Lilach Zitnitski 2017-01-08 14:22:17 UTC
Created attachment 1238393 [details]
logs

engine and vdsm

Comment 2 Yaniv Kaul 2017-01-09 07:16:07 UTC
Lilach, I assume the problematic area is not in Engine but in VDSM:
2017-01-08 15:15:49,213 INFO  (libvirt/events) [virt.vm] (vmId='e5477f2b-6c09-4528-a0d4-1fe3358d7b47') CPU stopped: onSuspend (vm:4865)
2017-01-08 15:15:49,214 ERROR (jsonrpc/0) [virt.vm] (vmId='e5477f2b-6c09-4528-a0d4-1fe3358d7b47') Unable to take snapshot (vm:3506)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 3503, in snapshot
    self._dom.snapshotCreateXML(snapxml, snapFlags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2737, in snapshotCreateXML
    if ret is None:raise libvirtError('virDomainSnapshotCreateXML() failed', dom=self)
libvirtError: Failed to acquire lock: File exists

Comment 3 Nir Soffer 2017-01-09 10:55:46 UTC
This requires libvirt-2.0.0-10.el7_3.4. Vdsm does not require it yet, since
it was not released yet.

Making this bug depend on the libvirt bug 1403691.

Comment 4 Yaniv Kaul 2017-02-12 14:01:00 UTC
Since the relevant libvirt fix was released on z-stream, can you re-test?

Comment 5 Lilach Zitnitski 2017-02-13 09:48:20 UTC
(In reply to Yaniv Kaul from comment #4)
> Since the relevant libvirt fix was released on z-stream, can you re-test?

When testing on 4.1.1 - the vm stays active during the process and the lsm completes.

Comment 6 Allon Mureinik 2017-02-13 10:28:40 UTC
(In reply to Yaniv Kaul from comment #4)
> Since the relevant libvirt fix was released on z-stream, can you re-test?

(In reply to Lilach Zitnitski from comment #5)
> (In reply to Yaniv Kaul from comment #4)
> > Since the relevant libvirt fix was released on z-stream, can you re-test?
> 
> When testing on 4.1.1 - the vm stays active during the process and the lsm
> completes.

VDSM 4.1.1. already requires a libvirt version that fixes this issue (2.0.0-10.el7_3.4) since 4.19.3.

Moving to ON_QA.

Lilach - up to you whether your comment 5 counts as verification or whether you want to perform and other tests.

Comment 7 Lilach Zitnitski 2017-02-13 12:31:35 UTC
Tested with rhevm 4.1.1 vdsm 4.19.5 and got the expected results - moving to VERIFIED.


Note You need to log in before you can comment on or make changes to this bug.