Description of problem: After a Live Storage Migration (LSM), a disk remains in "locked" status and the LSM sequence doesn't complete on the engine side. VmReplicateDiskFinishVDSCommand and DeleteImageGroupVDSCommand are never executed.This was reported as fixed in BZ 1161261 however it is still happening. The associated job table entry in the RHEV database remains "STARTED". Version-Release number of selected component (if applicable): engine=> select * from job order by start_time desc; job_id | action_type | description | status | owner_id | visible | start_time | end_time | last_update_time | correlation_id | is_external | is_auto_cleared --------------------------------------+-----------------+-------------------------------------------------------------------+---------+--------------------------------------+---------+------- ------------------------+----------+-------------------------------+----------------+-------------+----------------- dfabc2ad-9081-4566-b35d-ccaa611d47d7 | LiveMigrateDisk | Migrating Disk opsprj01_Disk1 from HostingStor02 to HostingStor04 | STARTED | 5932b845-ab89-4090-b4b0-87a82a0af8e1 | t | 2016-0 3-11 13:38:42.841+05:30 | | 2016-03-11 13:53:59.858+05:30 | 2fdd57df | f | t (1 row) How reproducible: N/A StepsOne disk task did not complete for disk to Reproduce: 1. Customer was live migrating multiple vm disks from HostingStor02 to HostingStor04. 2. One LSM task did not complete for disk opsprj01_Disk1. Actual results: One disk failed to clean up after completing the LSM Expected results: All of the LSMs should complete and the associated jobs in the database be marked as "FINISHED".
I have all the logs from RHEVM, host running VM and the SPM however they are too large to attach to the case.
Hi Frank, Cleaning up after a failure in live storage migration is currently not applicable in every scenario - see https://bugzilla.redhat.com/show_bug.cgi?id=1070863#c5. In order to ensure it's indeed a similar issue can you please try to reproduce it and attach the relevant sections from engine and vdsm logs. Thanks!
(In reply to Daniel Erez from comment #2) > Hi Frank, > > Cleaning up after a failure in live storage migration is currently not > applicable in every scenario - see > https://bugzilla.redhat.com/show_bug.cgi?id=1070863#c5. > In order to ensure it's indeed a similar issue can you please try to > reproduce it and attach the relevant sections from engine and vdsm logs. > > Thanks! Closing under the assumption this is the same scenario, since the needinfo went unanswered. Feel free to reopen when you have those logs.
Hi Javier, To understand to status of the migrated disks, can you please attach the output of 'vdsClient -s 0 getAllTasksInfo' from the SPM?
According to the engine logs ([1]/[2]/[3]), seems there was an issue with the host in adjacent to SyncImageGroupData initiation. Is the issue getting reproduced constantly? Or is it a specific scenario that wasn't reproduced? [1] 2016-04-12 11:06:09,304 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (org.ovirt.thread.pool-7-thread-32) [5f00c88] START, SyncImageGroupDataVDSCommand( storagePoolId = cbcfe73c-c50b-11e1-a347-00237d9cdcbd, ignoreFailoverLimit = false, storageDomainId = d1c3a3be-f87b-4b32-8753-ed8dc4d283d7, imageGroupId = 4cd6d305-edef-4125-b984-27929bb3c438, dstDomainId = b1170d78-c7ae-4eff-a36d-496995ace80b, syncType=INTERNAL), log id: 63003ab6 [2] 2016-04-12 11:06:09,719 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (org.ovirt.thread.pool-7-thread-32) [5f00c88] [within thread]: Some of the tasks related to command id 07fc1fc0-20b2-42fe-9062-c568046f45a0 were not cleared yet (Task id 9d7302d8-aa2b-48fa-a4f7-7599c4e80751 is in state Polling). 2016-04-12 11:06:14,186 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (org.ovirt.thread.pool-7-thread-34) [214f9840] Command DestroyVDSCommand(HostName = ljrhev-h12.lojackhq.com.ar, HostId = 7d676242-fc8a-4245-a0ad-62238bdd7216, vmId=855343a9-62ab-4af0-8762-b15567ecf554, force=false, secondsToWait=30, gracefully=true, reason=) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException [3] 2016-04-12 12:30:42,794 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-2) [74e5c60c] Command GetCapabilitiesVDSCommand(HostName = ljrhev-h12.lojackhq.com.ar, HostId = 7d676242-fc8a-4245-a0ad-62238bdd7216, vds=Host[ljrhev-h12.lojackhq.com.ar,7d676242-fc8a-4245-a0ad-62238bdd7216]) execution failed. Exception: VDSRecoveringException: Recovering from crash or Initializing
This was a specific scenario that wasn't reproduced as far as I know. I do believe that there was another customer that hit this also. Frank
Couldn't find any tangible information from the logs to indicate the issue nor been able to reproduce a similar scenario on latest version. Since the issue was encountered in an early build of 3.5, it could already been mitigated on latest bits. Since the live migration operation in engine was transformed into the new CoCo infrastructure in 4.1 and the issue seems to be specific and not easily reproduced, moving the bug to MODIFIED for verification on 4.1. @Tal - can you please move to 4.1.
do we know how to reproduce?
(In reply to Aharon Canan from comment #18) > do we know how to reproduce? No, it wasn't consistently reproduced, as mentioned in: https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c16
We do not know how to reproduce/verify, therefore cant ACK this one.
Daniel - why is it in MODIFIED state? Which patch solves this?
(In reply to Yaniv Kaul from comment #21) > Daniel - why is it in MODIFIED state? Which patch solves this? Explained in https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c17 We weren't able to reproduce it on latest version, seems to be mitigated by recent work of transforming the logic into the new CoCo infrastructure (https://gerrit.ovirt.org/#/c/52568/). If can't reproduced by the QE, should I close it on WORKSFORME?
(In reply to Daniel Erez from comment #22) > (In reply to Yaniv Kaul from comment #21) > > Daniel - why is it in MODIFIED state? Which patch solves this? > > Explained in https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c17 > We weren't able to reproduce it on latest version, seems to be mitigated by > recent work of transforming the logic into the new CoCo infrastructure > (https://gerrit.ovirt.org/#/c/52568/). If can't reproduced by the QE, should > I close it on WORKSFORME? Yes. You can ask QE to run their flows to make sure nothing is broken, but certainly it should not move to MODIFIED.
We don't have any reproducer. If QA's regression testing on live merge passes, we'll consider this issue as solved.