Bug 1312741 - Engine continually shows Migrating Disk job while no running tasks are reported in connected vdsm hosts
Summary: Engine continually shows Migrating Disk job while no running tasks are report...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
medium
high vote
Target Milestone: ovirt-3.6.5
: 3.6.5
Assignee: Daniel Erez
QA Contact: Gilad Lazarovich
URL:
Whiteboard: storage
Depends On:
Blocks: 1315960
TreeView+ depends on / blocked
 
Reported: 2016-02-29 07:24 UTC by Gilad Lazarovich
Modified: 2016-06-23 04:52 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-21 14:39:49 UTC
oVirt Team: Storage
amureini: ovirt-3.6.z?
glazarov: planning_ack?
rule-engine: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
engine and vdsm logs (12.04 MB, application/octet-stream)
2016-02-29 07:24 UTC, Gilad Lazarovich
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 54730 0 None None None 2016-03-15 17:08:31 UTC

Description Gilad Lazarovich 2016-02-29 07:24:07 UTC
Created attachment 1131432 [details]
engine and vdsm logs

Description of problem:
Migrating Disk job hangs in engine after a number of migrations are performed. The VDSM hosts no longer show any running tasks

Version-Release number of selected component (if applicable):


How reproducible:
50% (about every second full tier 2 Live migration run)

Steps to Reproduce:
1. Run Live migrations for disks using same and different domain types
2. Repeat migrations using all available permutations

Actual results:
Migrating Disk job does not show up as complete in the engine.

Expected results:
The job/task state should be in sync between the engine and the vdsm hosts

Additional info:
Here's the start of the disk migration job:
2016-02-29 03:10:41,822 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-32) [3c171353] Correlation ID: 3c171353, Job ID: 0e96ff1f-6a49-4626-b756-929c2e44f258, Call Stack: null, Custom Event ID: -1, Message: User admin@internal moving disk disk_TestCase5988_REST_ISCSI_2016-02-29_03-01-50_Disk_virtio_cow_sparse-True_alias to domain iscsi_1

Here's the failed remove on the disk noting the migration is still ongoing over 6.5 hours later:
2016-02-29 08:45:33,457 WARN  [org.ovirt.engine.core.bll.RemoveVmCommand] (org.ovirt.thread.pool-6-thread-17) [68343745] CanDoAction of action 'RemoveVm' failed for user admin@internal. Reasons: VAR__ACTION__REMOVE,VAR__TYPE__VM,ACTION_TYPE_FAILED_DISK_IS_BEING_MIGRATED,$DiskName disk_TestCase5988_REST_ISCSI_2016-02-29_03-01-50_Disk_virtio_cow_sparse-True_alias

Please find attached logs

Comment 1 Allon Mureinik 2016-02-29 12:10:48 UTC
Gilad, can you please specify the engine and VDSM's [rpm] versions please?

Comment 2 Gilad Lazarovich 2016-03-02 08:42:50 UTC
Allon, sure:
Engine: 3.6.3.2-0.1
VDSM: 4.17.23-0

Comment 3 Daniel Erez 2016-03-14 16:25:17 UTC
Hi Gilad,

After reproducing the described scenario:
* What is the status of the disks? I.e. are they locked?
* Are any of the disks failed to migrate?
* Does it reproduce only in scale tests?
* What was the magnitude of migration in the described flow?

Thanks!
Daniel

Comment 4 Allon Mureinik 2016-03-27 14:30:07 UTC
Daniel, https://gerrit.ovirt.org/#/c/54730/ and its backport https://gerrit.ovirt.org/54776 are both merged. Is there anything else we need for this BZ? If not, can it be moved to MODIFIED?

Comment 5 Daniel Erez 2016-03-28 08:38:25 UTC
(In reply to Allon Mureinik from comment #4)
> Daniel, https://gerrit.ovirt.org/#/c/54730/ and its backport
> https://gerrit.ovirt.org/54776 are both merged. Is there anything else we
> need for this BZ? If not, can it be moved to MODIFIED?

Yes, Should be rechecked on latest build.

Comment 6 Eyal Edri 2016-03-31 08:36:09 UTC
Bugs moved pre-mature to ON_QA since they didn't have target release.
Notice that only bugs with a set target release will move to ON_QA.

Comment 7 Gilad Lazarovich 2016-04-10 12:55:05 UTC
Verified to be fixed with 3.6.5, ran 5 full live storage migration runs, no stuck task encountered


Note You need to log in before you can comment on or make changes to this bug.