Bug 1318724

Summary:	VmReplicateDiskFinishVDSCommand is not executed when Live StorageMigration is initiated, leaving unfinished job
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Frank DeLorey <fdelorey>
Component:	ovirt-engine	Assignee:	Daniel Erez <derez>
Status:	CLOSED WORKSFORME	QA Contact:	Aharon Canan <acanan>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.5.1	CC:	acanan, amureini, derez, ebenahar, fdelorey, gwatson, jcoscia, lsurette, mgoldboi, rbalakri, Rhev-m-bugs, tnisan, yeylon, ykaul
Target Milestone:	ovirt-4.0.0-rc	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-06-13 11:24:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frank DeLorey 2016-03-17 15:29:45 UTC

Description of problem:

After a Live Storage Migration (LSM), a disk remains in "locked" status and the LSM sequence doesn't complete on the engine side. VmReplicateDiskFinishVDSCommand and DeleteImageGroupVDSCommand are never executed.This was reported as fixed in BZ 1161261 however it is still happening.

The associated job table entry in the RHEV database remains "STARTED".

Version-Release number of selected component (if applicable):

engine=> select * from job order by start_time desc;
                job_id                |   action_type   |                            description                            | status  |               owner_id               | visible |       
   start_time           | end_time |       last_update_time        | correlation_id | is_external | is_auto_cleared 
--------------------------------------+-----------------+-------------------------------------------------------------------+---------+--------------------------------------+---------+-------
------------------------+----------+-------------------------------+----------------+-------------+-----------------
 dfabc2ad-9081-4566-b35d-ccaa611d47d7 | LiveMigrateDisk | Migrating Disk opsprj01_Disk1 from HostingStor02 to HostingStor04 | STARTED | 5932b845-ab89-4090-b4b0-87a82a0af8e1 | t       | 2016-0
3-11 13:38:42.841+05:30 |          | 2016-03-11 13:53:59.858+05:30 | 2fdd57df       | f           | t
(1 row)

How reproducible:

N/A

StepsOne disk task did not complete for disk  to Reproduce:
1. Customer was live migrating multiple vm disks from HostingStor02 to HostingStor04.
2. One LSM task did not complete for disk opsprj01_Disk1.


Actual results:

One disk failed to clean up after completing the LSM

Expected results:

All of the LSMs should complete and the associated jobs in the database be marked as "FINISHED".

Comment 1 Frank DeLorey 2016-03-17 16:28:49 UTC

I have all the logs from RHEVM, host running VM and the SPM however they are too large to attach to the case.

Comment 2 Daniel Erez 2016-03-28 12:42:36 UTC

Hi Frank,

Cleaning up after a failure in live storage migration is currently not applicable in every scenario - see https://bugzilla.redhat.com/show_bug.cgi?id=1070863#c5.
In order to ensure it's indeed a similar issue can you please try to reproduce it and attach the relevant sections from engine and vdsm logs.

Thanks!

Comment 3 Allon Mureinik 2016-04-11 12:09:21 UTC

(In reply to Daniel Erez from comment #2)
> Hi Frank,
> 
> Cleaning up after a failure in live storage migration is currently not
> applicable in every scenario - see
> https://bugzilla.redhat.com/show_bug.cgi?id=1070863#c5.
> In order to ensure it's indeed a similar issue can you please try to
> reproduce it and attach the relevant sections from engine and vdsm logs.
> 
> Thanks!
Closing under the assumption this is the same scenario, since the needinfo went unanswered.

Feel free to reopen when you have those logs.

Comment 11 Daniel Erez 2016-04-24 07:18:05 UTC

Hi Javier,

To understand to status of the migrated disks, can you please attach the output of 'vdsClient -s 0 getAllTasksInfo' from the SPM?

Comment 15 Daniel Erez 2016-06-01 09:50:40 UTC

According to the engine logs ([1]/[2]/[3]), seems there was an issue with the host in adjacent to SyncImageGroupData initiation. Is the issue getting reproduced constantly? Or is it a specific scenario that wasn't reproduced?

[1]
2016-04-12 11:06:09,304 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (org.ovirt.thread.pool-7-thread-32) [5f00c88] START, SyncImageGroupDataVDSCommand( storagePoolId = cbcfe73c-c50b-11e1-a347-00237d9cdcbd, ignoreFailoverLimit = false, storageDomainId = d1c3a3be-f87b-4b32-8753-ed8dc4d283d7, imageGroupId = 4cd6d305-edef-4125-b984-27929bb3c438, dstDomainId = b1170d78-c7ae-4eff-a36d-496995ace80b, syncType=INTERNAL), log id: 63003ab6

[2]
2016-04-12 11:06:09,719 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (org.ovirt.thread.pool-7-thread-32) [5f00c88] [within thread]: Some of the tasks related to command id 07fc1fc0-20b2-42fe-9062-c568046f45a0 were not cleared yet (Task id 9d7302d8-aa2b-48fa-a4f7-7599c4e80751 is in state Polling).
2016-04-12 11:06:14,186 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (org.ovirt.thread.pool-7-thread-34) [214f9840] Command DestroyVDSCommand(HostName = ljrhev-h12.lojackhq.com.ar, HostId = 7d676242-fc8a-4245-a0ad-62238bdd7216, vmId=855343a9-62ab-4af0-8762-b15567ecf554, force=false, secondsToWait=30, gracefully=true, reason=) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException

[3]
2016-04-12 12:30:42,794 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-2) [74e5c60c] Command GetCapabilitiesVDSCommand(HostName = ljrhev-h12.lojackhq.com.ar, HostId = 7d676242-fc8a-4245-a0ad-62238bdd7216, vds=Host[ljrhev-h12.lojackhq.com.ar,7d676242-fc8a-4245-a0ad-62238bdd7216]) execution failed. Exception: VDSRecoveringException: Recovering from crash or Initializing

Comment 16 Frank DeLorey 2016-06-02 20:49:28 UTC

This was a specific scenario that wasn't reproduced as far as I know. I do believe that there was another customer that hit this also.

Frank

Comment 17 Daniel Erez 2016-06-06 11:20:32 UTC

Couldn't find any tangible information from the logs to indicate the issue nor been able to reproduce a similar scenario on latest version. Since the issue was encountered in an early build of 3.5, it could already been mitigated on latest bits. Since the live migration operation in engine was transformed into the new CoCo infrastructure in 4.1 and the issue seems to be specific and not easily reproduced, moving the bug to MODIFIED for verification on 4.1.

@Tal - can you please move to 4.1.

Comment 18 Aharon Canan 2016-06-09 14:42:49 UTC

do we know how to reproduce?

Comment 19 Daniel Erez 2016-06-13 05:04:41 UTC

(In reply to Aharon Canan from comment #18)
> do we know how to reproduce?

No, it wasn't consistently reproduced, as mentioned in: https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c16

Comment 20 Aharon Canan 2016-06-13 08:22:15 UTC

We do not know how to reproduce/verify, therefore cant ACK this one.

Comment 21 Yaniv Kaul 2016-06-13 09:17:56 UTC

Daniel - why is it in MODIFIED state? Which patch solves this?

Comment 22 Daniel Erez 2016-06-13 09:39:31 UTC

(In reply to Yaniv Kaul from comment #21)
> Daniel - why is it in MODIFIED state? Which patch solves this?

Explained in https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c17
We weren't able to reproduce it on latest version, seems to be mitigated by recent work of transforming the logic into the new CoCo infrastructure (https://gerrit.ovirt.org/#/c/52568/). If can't reproduced by the QE, should I close it on WORKSFORME?

Comment 23 Yaniv Kaul 2016-06-13 11:20:45 UTC

(In reply to Daniel Erez from comment #22)
> (In reply to Yaniv Kaul from comment #21)
> > Daniel - why is it in MODIFIED state? Which patch solves this?
> 
> Explained in https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c17
> We weren't able to reproduce it on latest version, seems to be mitigated by
> recent work of transforming the logic into the new CoCo infrastructure
> (https://gerrit.ovirt.org/#/c/52568/). If can't reproduced by the QE, should
> I close it on WORKSFORME?

Yes. You can ask QE to run their flows to make sure nothing is broken, but certainly it should not move to MODIFIED.

Comment 24 Allon Mureinik 2016-06-13 11:24:32 UTC

We don't have any reproducer. If QA's regression testing on live merge passes, we'll consider this issue as solved.