Description of problem: We have seen two customer cases where a live storage migration has essentially completed, but the associated job table entry in the RHEV database is not marked as "FINISHED", e.g. action_type | description | status | start_time | end_time -----------------+-----------------------------------------------------+---------+----------------------------+---------- LiveMigrateDisk | Migrating Disk lsm-vm_Disk1 from LSM_GFW to NFS_GFW | STARTED | 2014-11-04 17:34:20.213-05 | In the Admin Portal the Tasks pane shows that the storage migration is still in progress. The engine sequence below completes; CloneImageGroupStructureVDSCommand VmReplicateDiskStartVDSCommand SyncImageGroupDataVDSCommand VmReplicateDiskFinishVDSCommand DeleteImageGroupVDSCommand On the SPM host, these have completed successfully. Looking at the Storage Domain, the disk images only exist in the destination domain. However what is missing from the engine sequence is the following; 2014-10-22 17:55:26,184 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-4-thread-46) [295a737f] Ending command successfully: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand 2014-10-22 17:55:26,184 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-4-thread-46) [295a737f] Lock freed to object EngineLock [exclusiveLocks= , sharedLocks= key: 6e0cbf5c-b52c-488d-b1dd-a0565dd31ba7 value: VM 2014-10-22 17:55:27,717 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-4-thread-46) [295a737f] Correlation ID: 295a737f, Job ID: c55f0eb3-3b37-48a9-b3af-6a3c4740ae4e, Call Stack: null, Custom Event ID: -1, Message: User admin finished moving disk VM_Disk1 to domain SD-A. I have been able to recreate this several times by live migrating 20 (arbitrary number) disks at once. Version-Release number of selected component (if applicable): My test environment; - RHEV-M 3.4.3 - Single host with (essentially) vdsm-4.14.13-2 Customer environments; - RHEV-M 3.3.4 - Hosts with vdsm-4.14.11-5 - RHEV-M-3.4.2 - 2 hosts - vdsm-4.13.2-0.13 - 1 host - vdsm-4.14.11-5 (SPM) How reproducible: If I try to live migrate 20 disks at once I have encountered this problem every time. Steps to Reproduce: 1. In my specific case I created a pool of 20 VMs based off a template in an NFS data domain. 2. I then started all 20 VMs. 3. I then copied the template to a second NFS domain. 4. I then live migrated all 20 disks to the second NFS domain. Actual results: One of the 20 failed as described above. Another failed in a different way, which will be described in a separate bug. The disk for this one remained "locked" and the LSM sequence did not complete on the engine side. It never executed VmReplicateDiskFinishVDSCommand. Expected results: All of the LSMs should complete and the associated jobs in the database be marked as "FINISHED". Additional info: Supporting data will be added shortly.
Hi Gordon, Can you please attach the log of the failed migration - as mentioned in bug description: "Another failed in a different way, which will be described in a separate bug. The disk for this one remained "locked" and the LSM sequence did not complete on the engine side."
Hi Gordon, Have you tried reproducing this on master/3.5?
Ravi, Daniel - how are we proceeding here?
Should be fixed in 3.5 build.
Eyal, can you please add the version it is fixed in? Thanks
(In reply to lkuchlan from comment #15) > Eyal, can you please add the version it is fixed in? > Thanks RHEV 3.5 vt8 should have this fix.
Created attachment 963240 [details] image Tested using RHEM 3.5 vt11 All of the LSMs completed successfully and the associated jobs in the database be marked as "FINISHED"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html