Bug 1318724 - VmReplicateDiskFinishVDSCommand is not executed when Live StorageMigration is initiated, leaving unfinished job
Summary: VmReplicateDiskFinishVDSCommand is not executed when Live StorageMigration i...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.0.0-rc
: ---
Assignee: Daniel Erez
QA Contact: Aharon Canan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-17 15:29 UTC by Frank DeLorey
Modified: 2019-11-14 07:37 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-13 11:24:32 UTC
oVirt Team: Storage
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1298893 0 None None None 2016-04-19 19:32:55 UTC

Description Frank DeLorey 2016-03-17 15:29:45 UTC
Description of problem:

After a Live Storage Migration (LSM), a disk remains in "locked" status and the LSM sequence doesn't complete on the engine side. VmReplicateDiskFinishVDSCommand and DeleteImageGroupVDSCommand are never executed.This was reported as fixed in BZ 1161261 however it is still happening.

The associated job table entry in the RHEV database remains "STARTED".

Version-Release number of selected component (if applicable):

engine=> select * from job order by start_time desc;
                job_id                |   action_type   |                            description                            | status  |               owner_id               | visible |       
   start_time           | end_time |       last_update_time        | correlation_id | is_external | is_auto_cleared 
--------------------------------------+-----------------+-------------------------------------------------------------------+---------+--------------------------------------+---------+-------
------------------------+----------+-------------------------------+----------------+-------------+-----------------
 dfabc2ad-9081-4566-b35d-ccaa611d47d7 | LiveMigrateDisk | Migrating Disk opsprj01_Disk1 from HostingStor02 to HostingStor04 | STARTED | 5932b845-ab89-4090-b4b0-87a82a0af8e1 | t       | 2016-0
3-11 13:38:42.841+05:30 |          | 2016-03-11 13:53:59.858+05:30 | 2fdd57df       | f           | t
(1 row)

How reproducible:

N/A

StepsOne disk task did not complete for disk  to Reproduce:
1. Customer was live migrating multiple vm disks from HostingStor02 to HostingStor04.
2. One LSM task did not complete for disk opsprj01_Disk1.


Actual results:

One disk failed to clean up after completing the LSM

Expected results:

All of the LSMs should complete and the associated jobs in the database be marked as "FINISHED".

Comment 1 Frank DeLorey 2016-03-17 16:28:49 UTC
I have all the logs from RHEVM, host running VM and the SPM however they are too large to attach to the case.

Comment 2 Daniel Erez 2016-03-28 12:42:36 UTC
Hi Frank,

Cleaning up after a failure in live storage migration is currently not applicable in every scenario - see https://bugzilla.redhat.com/show_bug.cgi?id=1070863#c5.
In order to ensure it's indeed a similar issue can you please try to reproduce it and attach the relevant sections from engine and vdsm logs.

Thanks!

Comment 3 Allon Mureinik 2016-04-11 12:09:21 UTC
(In reply to Daniel Erez from comment #2)
> Hi Frank,
> 
> Cleaning up after a failure in live storage migration is currently not
> applicable in every scenario - see
> https://bugzilla.redhat.com/show_bug.cgi?id=1070863#c5.
> In order to ensure it's indeed a similar issue can you please try to
> reproduce it and attach the relevant sections from engine and vdsm logs.
> 
> Thanks!
Closing under the assumption this is the same scenario, since the needinfo went unanswered.

Feel free to reopen when you have those logs.

Comment 11 Daniel Erez 2016-04-24 07:18:05 UTC
Hi Javier,

To understand to status of the migrated disks, can you please attach the output of 'vdsClient -s 0 getAllTasksInfo' from the SPM?

Comment 15 Daniel Erez 2016-06-01 09:50:40 UTC
According to the engine logs ([1]/[2]/[3]), seems there was an issue with the host in adjacent to SyncImageGroupData initiation. Is the issue getting reproduced constantly? Or is it a specific scenario that wasn't reproduced?

[1]
2016-04-12 11:06:09,304 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (org.ovirt.thread.pool-7-thread-32) [5f00c88] START, SyncImageGroupDataVDSCommand( storagePoolId = cbcfe73c-c50b-11e1-a347-00237d9cdcbd, ignoreFailoverLimit = false, storageDomainId = d1c3a3be-f87b-4b32-8753-ed8dc4d283d7, imageGroupId = 4cd6d305-edef-4125-b984-27929bb3c438, dstDomainId = b1170d78-c7ae-4eff-a36d-496995ace80b, syncType=INTERNAL), log id: 63003ab6

[2]
2016-04-12 11:06:09,719 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (org.ovirt.thread.pool-7-thread-32) [5f00c88] [within thread]: Some of the tasks related to command id 07fc1fc0-20b2-42fe-9062-c568046f45a0 were not cleared yet (Task id 9d7302d8-aa2b-48fa-a4f7-7599c4e80751 is in state Polling).
2016-04-12 11:06:14,186 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (org.ovirt.thread.pool-7-thread-34) [214f9840] Command DestroyVDSCommand(HostName = ljrhev-h12.lojackhq.com.ar, HostId = 7d676242-fc8a-4245-a0ad-62238bdd7216, vmId=855343a9-62ab-4af0-8762-b15567ecf554, force=false, secondsToWait=30, gracefully=true, reason=) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException

[3]
2016-04-12 12:30:42,794 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-2) [74e5c60c] Command GetCapabilitiesVDSCommand(HostName = ljrhev-h12.lojackhq.com.ar, HostId = 7d676242-fc8a-4245-a0ad-62238bdd7216, vds=Host[ljrhev-h12.lojackhq.com.ar,7d676242-fc8a-4245-a0ad-62238bdd7216]) execution failed. Exception: VDSRecoveringException: Recovering from crash or Initializing

Comment 16 Frank DeLorey 2016-06-02 20:49:28 UTC
This was a specific scenario that wasn't reproduced as far as I know. I do believe that there was another customer that hit this also.

Frank

Comment 17 Daniel Erez 2016-06-06 11:20:32 UTC
Couldn't find any tangible information from the logs to indicate the issue nor been able to reproduce a similar scenario on latest version. Since the issue was encountered in an early build of 3.5, it could already been mitigated on latest bits. Since the live migration operation in engine was transformed into the new CoCo infrastructure in 4.1 and the issue seems to be specific and not easily reproduced, moving the bug to MODIFIED for verification on 4.1.

@Tal - can you please move to 4.1.

Comment 18 Aharon Canan 2016-06-09 14:42:49 UTC
do we know how to reproduce?

Comment 19 Daniel Erez 2016-06-13 05:04:41 UTC
(In reply to Aharon Canan from comment #18)
> do we know how to reproduce?

No, it wasn't consistently reproduced, as mentioned in: https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c16

Comment 20 Aharon Canan 2016-06-13 08:22:15 UTC
We do not know how to reproduce/verify, therefore cant ACK this one.

Comment 21 Yaniv Kaul 2016-06-13 09:17:56 UTC
Daniel - why is it in MODIFIED state? Which patch solves this?

Comment 22 Daniel Erez 2016-06-13 09:39:31 UTC
(In reply to Yaniv Kaul from comment #21)
> Daniel - why is it in MODIFIED state? Which patch solves this?

Explained in https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c17
We weren't able to reproduce it on latest version, seems to be mitigated by recent work of transforming the logic into the new CoCo infrastructure (https://gerrit.ovirt.org/#/c/52568/). If can't reproduced by the QE, should I close it on WORKSFORME?

Comment 23 Yaniv Kaul 2016-06-13 11:20:45 UTC
(In reply to Daniel Erez from comment #22)
> (In reply to Yaniv Kaul from comment #21)
> > Daniel - why is it in MODIFIED state? Which patch solves this?
> 
> Explained in https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c17
> We weren't able to reproduce it on latest version, seems to be mitigated by
> recent work of transforming the logic into the new CoCo infrastructure
> (https://gerrit.ovirt.org/#/c/52568/). If can't reproduced by the QE, should
> I close it on WORKSFORME?

Yes. You can ask QE to run their flows to make sure nothing is broken, but certainly it should not move to MODIFIED.

Comment 24 Allon Mureinik 2016-06-13 11:24:32 UTC
We don't have any reproducer. If QA's regression testing on live merge passes, we'll consider this issue as solved.


Note You need to log in before you can comment on or make changes to this bug.