Bug 1318724
Summary: | VmReplicateDiskFinishVDSCommand is not executed when Live StorageMigration is initiated, leaving unfinished job | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Frank DeLorey <fdelorey> |
Component: | ovirt-engine | Assignee: | Daniel Erez <derez> |
Status: | CLOSED WORKSFORME | QA Contact: | Aharon Canan <acanan> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.5.1 | CC: | acanan, amureini, derez, ebenahar, fdelorey, gwatson, jcoscia, lsurette, mgoldboi, rbalakri, Rhev-m-bugs, tnisan, yeylon, ykaul |
Target Milestone: | ovirt-4.0.0-rc | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-06-13 11:24:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Frank DeLorey
2016-03-17 15:29:45 UTC
I have all the logs from RHEVM, host running VM and the SPM however they are too large to attach to the case. Hi Frank, Cleaning up after a failure in live storage migration is currently not applicable in every scenario - see https://bugzilla.redhat.com/show_bug.cgi?id=1070863#c5. In order to ensure it's indeed a similar issue can you please try to reproduce it and attach the relevant sections from engine and vdsm logs. Thanks! (In reply to Daniel Erez from comment #2) > Hi Frank, > > Cleaning up after a failure in live storage migration is currently not > applicable in every scenario - see > https://bugzilla.redhat.com/show_bug.cgi?id=1070863#c5. > In order to ensure it's indeed a similar issue can you please try to > reproduce it and attach the relevant sections from engine and vdsm logs. > > Thanks! Closing under the assumption this is the same scenario, since the needinfo went unanswered. Feel free to reopen when you have those logs. Hi Javier, To understand to status of the migrated disks, can you please attach the output of 'vdsClient -s 0 getAllTasksInfo' from the SPM? According to the engine logs ([1]/[2]/[3]), seems there was an issue with the host in adjacent to SyncImageGroupData initiation. Is the issue getting reproduced constantly? Or is it a specific scenario that wasn't reproduced? [1] 2016-04-12 11:06:09,304 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (org.ovirt.thread.pool-7-thread-32) [5f00c88] START, SyncImageGroupDataVDSCommand( storagePoolId = cbcfe73c-c50b-11e1-a347-00237d9cdcbd, ignoreFailoverLimit = false, storageDomainId = d1c3a3be-f87b-4b32-8753-ed8dc4d283d7, imageGroupId = 4cd6d305-edef-4125-b984-27929bb3c438, dstDomainId = b1170d78-c7ae-4eff-a36d-496995ace80b, syncType=INTERNAL), log id: 63003ab6 [2] 2016-04-12 11:06:09,719 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (org.ovirt.thread.pool-7-thread-32) [5f00c88] [within thread]: Some of the tasks related to command id 07fc1fc0-20b2-42fe-9062-c568046f45a0 were not cleared yet (Task id 9d7302d8-aa2b-48fa-a4f7-7599c4e80751 is in state Polling). 2016-04-12 11:06:14,186 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (org.ovirt.thread.pool-7-thread-34) [214f9840] Command DestroyVDSCommand(HostName = ljrhev-h12.lojackhq.com.ar, HostId = 7d676242-fc8a-4245-a0ad-62238bdd7216, vmId=855343a9-62ab-4af0-8762-b15567ecf554, force=false, secondsToWait=30, gracefully=true, reason=) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException [3] 2016-04-12 12:30:42,794 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-2) [74e5c60c] Command GetCapabilitiesVDSCommand(HostName = ljrhev-h12.lojackhq.com.ar, HostId = 7d676242-fc8a-4245-a0ad-62238bdd7216, vds=Host[ljrhev-h12.lojackhq.com.ar,7d676242-fc8a-4245-a0ad-62238bdd7216]) execution failed. Exception: VDSRecoveringException: Recovering from crash or Initializing This was a specific scenario that wasn't reproduced as far as I know. I do believe that there was another customer that hit this also. Frank Couldn't find any tangible information from the logs to indicate the issue nor been able to reproduce a similar scenario on latest version. Since the issue was encountered in an early build of 3.5, it could already been mitigated on latest bits. Since the live migration operation in engine was transformed into the new CoCo infrastructure in 4.1 and the issue seems to be specific and not easily reproduced, moving the bug to MODIFIED for verification on 4.1. @Tal - can you please move to 4.1. do we know how to reproduce? (In reply to Aharon Canan from comment #18) > do we know how to reproduce? No, it wasn't consistently reproduced, as mentioned in: https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c16 We do not know how to reproduce/verify, therefore cant ACK this one. Daniel - why is it in MODIFIED state? Which patch solves this? (In reply to Yaniv Kaul from comment #21) > Daniel - why is it in MODIFIED state? Which patch solves this? Explained in https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c17 We weren't able to reproduce it on latest version, seems to be mitigated by recent work of transforming the logic into the new CoCo infrastructure (https://gerrit.ovirt.org/#/c/52568/). If can't reproduced by the QE, should I close it on WORKSFORME? (In reply to Daniel Erez from comment #22) > (In reply to Yaniv Kaul from comment #21) > > Daniel - why is it in MODIFIED state? Which patch solves this? > > Explained in https://bugzilla.redhat.com/show_bug.cgi?id=1318724#c17 > We weren't able to reproduce it on latest version, seems to be mitigated by > recent work of transforming the logic into the new CoCo infrastructure > (https://gerrit.ovirt.org/#/c/52568/). If can't reproduced by the QE, should > I close it on WORKSFORME? Yes. You can ask QE to run their flows to make sure nothing is broken, but certainly it should not move to MODIFIED. We don't have any reproducer. If QA's regression testing on live merge passes, we'll consider this issue as solved. |