Bug 1312741

Summary: Engine continually shows Migrating Disk job while no running tasks are reported in connected vdsm hosts
Product: [oVirt] ovirt-engine Reporter: Gilad Lazarovich <glazarov>
Component: BLL.StorageAssignee: Daniel Erez <derez>
Status: CLOSED CURRENTRELEASE QA Contact: Gilad Lazarovich <glazarov>
Severity: high Docs Contact:
Priority: medium    
Version: 3.6.0CC: acanan, amureini, bugs, derez, glazarov
Target Milestone: ovirt-3.6.5Keywords: Automation, AutomationBlocker
Target Release: 3.6.5Flags: amureini: ovirt-3.6.z?
glazarov: planning_ack?
rule-engine: devel_ack+
rule-engine: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-21 14:39:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1315960    
Attachments:
Description Flags
engine and vdsm logs none

Description Gilad Lazarovich 2016-02-29 07:24:07 UTC
Created attachment 1131432 [details]
engine and vdsm logs

Description of problem:
Migrating Disk job hangs in engine after a number of migrations are performed. The VDSM hosts no longer show any running tasks

Version-Release number of selected component (if applicable):


How reproducible:
50% (about every second full tier 2 Live migration run)

Steps to Reproduce:
1. Run Live migrations for disks using same and different domain types
2. Repeat migrations using all available permutations

Actual results:
Migrating Disk job does not show up as complete in the engine.

Expected results:
The job/task state should be in sync between the engine and the vdsm hosts

Additional info:
Here's the start of the disk migration job:
2016-02-29 03:10:41,822 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-32) [3c171353] Correlation ID: 3c171353, Job ID: 0e96ff1f-6a49-4626-b756-929c2e44f258, Call Stack: null, Custom Event ID: -1, Message: User admin@internal moving disk disk_TestCase5988_REST_ISCSI_2016-02-29_03-01-50_Disk_virtio_cow_sparse-True_alias to domain iscsi_1

Here's the failed remove on the disk noting the migration is still ongoing over 6.5 hours later:
2016-02-29 08:45:33,457 WARN  [org.ovirt.engine.core.bll.RemoveVmCommand] (org.ovirt.thread.pool-6-thread-17) [68343745] CanDoAction of action 'RemoveVm' failed for user admin@internal. Reasons: VAR__ACTION__REMOVE,VAR__TYPE__VM,ACTION_TYPE_FAILED_DISK_IS_BEING_MIGRATED,$DiskName disk_TestCase5988_REST_ISCSI_2016-02-29_03-01-50_Disk_virtio_cow_sparse-True_alias

Please find attached logs

Comment 1 Allon Mureinik 2016-02-29 12:10:48 UTC
Gilad, can you please specify the engine and VDSM's [rpm] versions please?

Comment 2 Gilad Lazarovich 2016-03-02 08:42:50 UTC
Allon, sure:
Engine: 3.6.3.2-0.1
VDSM: 4.17.23-0

Comment 3 Daniel Erez 2016-03-14 16:25:17 UTC
Hi Gilad,

After reproducing the described scenario:
* What is the status of the disks? I.e. are they locked?
* Are any of the disks failed to migrate?
* Does it reproduce only in scale tests?
* What was the magnitude of migration in the described flow?

Thanks!
Daniel

Comment 4 Allon Mureinik 2016-03-27 14:30:07 UTC
Daniel, https://gerrit.ovirt.org/#/c/54730/ and its backport https://gerrit.ovirt.org/54776 are both merged. Is there anything else we need for this BZ? If not, can it be moved to MODIFIED?

Comment 5 Daniel Erez 2016-03-28 08:38:25 UTC
(In reply to Allon Mureinik from comment #4)
> Daniel, https://gerrit.ovirt.org/#/c/54730/ and its backport
> https://gerrit.ovirt.org/54776 are both merged. Is there anything else we
> need for this BZ? If not, can it be moved to MODIFIED?

Yes, Should be rechecked on latest build.

Comment 6 Eyal Edri 2016-03-31 08:36:09 UTC
Bugs moved pre-mature to ON_QA since they didn't have target release.
Notice that only bugs with a set target release will move to ON_QA.

Comment 7 Gilad Lazarovich 2016-04-10 12:55:05 UTC
Verified to be fixed with 3.6.5, ran 5 full live storage migration runs, no stuck task encountered