1160889 – Live Storage Migration "completes" but the engine sequence does not, leaving an unfinished job.

Bug 1160889 - Live Storage Migration "completes" but the engine sequence does not, leaving an unfinished job.

Summary: Live Storage Migration "completes" but the engine sequence does not, leaving ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.4.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Daniel Erez
QA Contact:	lkuchlan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-11-05 22:16 UTC by Gordon Watson
Modified:	2019-04-28 09:22 UTC (History)
CC List:	17 users (show)
Fixed In Version:	ovirt-engine-3.5.0_vt8
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-11 18:10:07 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
image (540.29 KB, image/x-xcf) 2014-12-01 11:11 UTC, lkuchlan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0158	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.5.0	2015-02-11 22:38:50 UTC

Description Gordon Watson 2014-11-05 22:16:07 UTC

Description of problem:

We have seen two customer cases where a live storage migration has essentially completed, but the associated job table entry in the RHEV database is not marked as "FINISHED", e.g.

In the Admin Portal the Tasks pane shows that the storage migration is still in progress.

The engine sequence below completes;

CloneImageGroupStructureVDSCommand
VmReplicateDiskStartVDSCommand
SyncImageGroupDataVDSCommand
VmReplicateDiskFinishVDSCommand
DeleteImageGroupVDSCommand

On the SPM host, these have completed successfully.

Looking at the Storage Domain, the disk images only exist in the destination domain.

However what is missing from the engine sequence is the following;

2014-10-22 17:55:26,184 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-4-thread-46) [295a737f] Ending command successfully: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2014-10-22 17:55:26,184 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-4-thread-46) [295a737f] Lock freed to object EngineLock [exclusiveLocks= , sharedLocks= key: 6e0cbf5c-b52c-488d-b1dd-a0565dd31ba7 value: VM
2014-10-22 17:55:27,717 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-4-thread-46) [295a737f] Correlation ID: 295a737f, Job ID: c55f0eb3-3b37-48a9-b3af-6a3c4740ae4e, Call Stack: null, Custom Event ID: -1, Message: User admin finished moving disk VM_Disk1 to domain SD-A.

I have been able to recreate this several times by live migrating 20 (arbitrary number) disks at once.

Version-Release number of selected component (if applicable):

My test environment;

- RHEV-M 3.4.3
- Single host with (essentially) vdsm-4.14.13-2

Customer environments;

- RHEV-M 3.3.4
- Hosts with vdsm-4.14.11-5

- RHEV-M-3.4.2
- 2 hosts - vdsm-4.13.2-0.13
- 1 host - vdsm-4.14.11-5 (SPM)

How reproducible:

If I try to live migrate 20 disks at once I have encountered this problem every time.

Steps to Reproduce:

1. In my specific case I created a pool of 20 VMs based off a template in an NFS data domain.
2. I then started all 20 VMs.
3. I then copied the template to a second NFS domain.
4. I then live migrated all 20 disks to the second NFS domain.

Actual results:

One of the 20 failed as described above.
Another failed in a different way, which will be described in a separate bug. The disk for this one remained "locked" and the LSM sequence did not complete on the engine side. It never executed VmReplicateDiskFinishVDSCommand.

Expected results:

All of the LSMs should complete and the associated jobs in the database be marked as "FINISHED".

Additional info:

Supporting data will be added shortly.

Comment 7 Daniel Erez 2014-11-09 13:09:55 UTC

Hi Gordon,

Can you please attach the log of the failed migration - as mentioned in bug description: "Another failed in a different way, which will be described in a separate bug. The disk for this one remained "locked" and the LSM sequence did not complete on the engine side."

Comment 11 Ravi Nori 2014-11-11 18:48:23 UTC

Hi Gordon,

Have you tried reproducing this on master/3.5?

Comment 13 Allon Mureinik 2014-11-17 16:45:45 UTC

Ravi, Daniel - how are we proceeding here?

Comment 14 Daniel Erez 2014-11-18 14:44:39 UTC

Should be fixed in 3.5 build.

Comment 15 lkuchlan 2014-11-23 14:23:03 UTC

Eyal, can you please add the version it is fixed in?
Thanks

Comment 16 Allon Mureinik 2014-11-23 18:51:49 UTC

(In reply to lkuchlan from comment #15)
> Eyal, can you please add the version it is fixed in?
> Thanks

RHEV 3.5 vt8 should have this fix.

Comment 17 lkuchlan 2014-12-01 11:11:38 UTC

Created attachment 963240 [details]
image

Tested using RHEM 3.5 vt11
All of the LSMs completed successfully and the associated jobs in the database be marked as "FINISHED"

Comment 19 errata-xmlrpc 2015-02-11 18:10:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.