Bug 1161261

Summary: VmReplicateDiskFinishVDSCommand is not executed when Live StorageMigration (LSM) is initated, leaving unfinished job
Product: Red Hat Enterprise Virtualization Manager Reporter: Bimal Chollera <bcholler>
Component: ovirt-engineAssignee: Daniel Erez <derez>
Status: CLOSED CURRENTRELEASE QA Contact: lkuchlan <lkuchlan>
Severity: high Docs Contact:
Priority: high    
Version: 3.4.3CC: amureini, ecohen, gklein, gwatson, iheim, lpeer, lsurette, mkalinin, rbalakri, Rhev-m-bugs, scohen, tnisan, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: All   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-16 19:09:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
images none

Description Bimal Chollera 2014-11-06 19:12:36 UTC
Description of problem:

After a Live Storage Migration (LSM), a disk remains in "locked" status and the LSM sequence doesn't complete on the engine side. VmReplicateDiskFinishVDSCommand and DeleteImageGroupVDSCommand are never executed.

The associated job table entry in the RHEV database remains "STARTED".

 correlation_id |                job_id                |   action_type   |                     description                     | status  
----------------+--------------------------------------+-----------------+-----------------------------------------------------+---------
 edf0edf        | de4e97cf-e747-4657-bd07-e60aa278c0f1 | LiveMigrateDisk | Migrating Disk lsm-vm_Disk1 from LSM_GFW to NFS_GFW | STARTED

In the Admin Portal, the Disks menu will report the disk attached to the VM in "locked" status.

The engine sequence below completes but VmReplicateDiskFinishVDSCommand and DeleteImageGroupVDSCommand are never executed after the SyncImageGroupDataVDSCommand.

CloneImageGroupStructureVDSCommand
VmReplicateDiskStartVDSCommand
SyncImageGroupDataVDSCommand

I have been able to recreate this several times by live migrating 20 (arbitrary number) disks at once.

Version-Release number of selected component (if applicable):

Test environment;

   - RHEV-M 3.4.3 
   - Single host with (essentially) vdsm-4.14.13-2

How reproducible:

If I try to live migrate 20 disks at once I have encountered this problem every time. 

Steps to Reproduce:

1. In my specific case I created a pool of 20 VMs based off a template in an NFS data domain. 
2. I then started all 20 VMs. 
3. I then copied the template to a second NFS domain.
4. I then live migrated all 20 disks to the second NFS domain.

Actual results:

One of the 20 failed as described above.

Expected results:

All of the LSMs should complete and the associated jobs in the database be marked as "FINISHED".

Comment 8 Daniel Erez 2014-11-18 14:44:31 UTC
Should be fixed in 3.5 build.

Comment 9 lkuchlan 2014-12-01 15:20:29 UTC
Created attachment 963328 [details]
images

Tested using RHEVM 3.5 vt11
All of the LSMs completed and the associated jobs in the database be marked as "FINISHED"

Comment 10 Allon Mureinik 2015-02-16 19:09:01 UTC
RHEV-M 3.5.0 has been released, closing this bug.