Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 889154

Summary: [ovirt-engine] host stuck in 'unassigned' forever in case activate is performed during 'preparing for maintenance' state (deadlock!)
Product: Red Hat Enterprise Virtualization Manager Reporter: Haim <hateya>
Component: ovirt-engineAssignee: Roy Golan <rgolan>
Status: CLOSED DUPLICATE QA Contact: Pavel Stehlik <pstehlik>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: acathrow, bazulay, iheim, jkt, lpeer, michal.skrivanek, pstehlik, Rhev-m-bugs, yeylon, yzaslavs
Target Milestone: ---   
Target Release: 3.2.3   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-12 13:39:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine, server, console logs. none

Description Haim 2012-12-20 10:59:11 UTC
Description of problem:

Bug was found on QE production setup. 

case: 

- host running several vm 
- put host into maintenance
- host moves to preparing for maintenance mode
- vms staring to migrate
- after 10 minutes, where some VMs failed to migrate and stuck on 'migration from'  
  I hit on activate
- host moves to unassigned state and stay like it for additional 2 hours till i restart oVirt engine service 

no command was sent to vdsm during that time.

I tried send kill -3 to java process to see thread dump however I didn't see any deadlock print there (attached).

Comment 1 Haim 2012-12-20 11:10:18 UTC
I have a theory, i guess it happens when some VMs went into pause state during migrate VM command.

2012-12-19 14:33:48,694 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (pool-3-thread-8) [2671efd9] ResourceManager::vdsMaintenance - Failed migrating desktop in
digo-vdc
2012-12-19 14:33:48,704 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-65) Failed to decrypt Data must start with zero
2012-12-19 14:33:48,733 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-3-thread-8) [777cfdd] CanDoAction of action InternalMigrateVm failed. Reasons:MIGRATE_P
AUSED_VM_IS_UNSUPPORTED,VAR__ACTION__MIGRATE,VAR__TYPE__VM
2012-12-19 14:33:48,733 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (pool-3-thread-8) [777cfdd] ResourceManager::vdsMaintenance - Failed migrating desktop Lim
e-VDC
2012-12-19 14:33:48,759 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-3-thread-8) [6251a4ab] CanDoAction of action InternalMigrateVm failed. Reasons:MIGRATE_
PAUSED_VM_IS_UNSUPPORTED,VAR__ACTION__MIGRATE,VAR__TYPE__VM
2012-12-19 14:33:48,759 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (pool-3-thread-8) [6251a4ab] ResourceManager::vdsMaintenance - Failed migrating desktop Se
lenium
2012-12-19 14:33:48,775 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-3-thread-8) [2d134059] CanDoAction of action InternalMigrateVm failed. Reasons:MIGRATE_
PAUSED_VM_IS_UNSUPPORTED,VAR__ACTION__MIGRATE,VAR__TYPE__VM
2012-12-19 14:33:48,775 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (pool-3-thread-8) [2d134059] ResourceManager::vdsMaintenance - Failed migrating desktop ge
na-31
2012-12-19 14:33:48,792 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-3-thread-8) [698bbdd6] CanDoAction of action InternalMigrateVm failed. Reasons:MIGRATE_
PAUSED_VM_IS_UNSUPPORTED,VAR__ACTION__MIGRATE,VAR__TYPE__VM
2012-12-19 14:33:48,792 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (pool-3-thread-8) [698bbdd6] ResourceManager::vdsMaintenance - Failed migrating desktop pa
ikov-rhevm-gluster3
2012-12-19 14:33:48,808 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-3-thread-8) [38e9afaf] CanDoAction of action InternalMigrateVm failed. Reasons:MIGRATE_
PAUSED_VM_IS_UNSUPPORTED,VAR__ACTION__MIGRATE,VAR__TYPE__VM
2012-12-19 14:33:48,808 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (pool-3-thread-8) [38e9afaf] ResourceManager::vdsMaintenance - Failed migrating desktop AR
T-mlnx-setup
2012-12-19 14:33:48,826 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-3-thread-8) [2dfb0161] CanDoAction of action InternalMigrateVm failed. Reasons:MIGRATE_
PAUSED_VM_IS_UNSUPPORTED,VAR__ACTION__MIGRATE,VAR__TYPE__VM
2012-12-19 14:33:48,827 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (pool-3-thread-8) [2dfb0161] ResourceManager::vdsMaintenance - Failed migrating desktop wh
eat-vdc
2012-12-19 14:33:48,850 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-3-thread-8) [7b1ff695] CanDoAction of action InternalMigrateVm failed. Reasons:MIGRATE_
PAUSED_VM_IS_UNSUPPORTED,VAR__ACTION__MIGRATE,VAR__TYPE__VM
2012-12-19 14:33:48,850 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (pool-3-thread-8) [7b1ff695] ResourceManager::vdsMaintenance - Failed migrating desktop mo
nique-vdc0

Comment 2 Haim 2012-12-20 11:17:26 UTC
Created attachment 666612 [details]
engine, server, console logs.

Comment 8 Eli Mesika 2013-03-11 14:38:00 UTC
please add Expected results section to the bug.
Should we block the activate in the migration phase ???
problem is clear but BZ description not specifying how it should work

Comment 9 Barak 2013-03-21 17:56:13 UTC
Simon, please answer questions in comment #8

Comment 10 Simon Grinberg 2013-03-21 18:17:20 UTC
(In reply to comment #8)
> Should we block the activate in the migration phase ???
> problem is clear but BZ description not specifying how it should work

Well the proper solution is to allow cancel = 'Stop all current migration and set back to up'. But since this is not trivial until we have a good infra for task management (I guess) then the easy solution for 3.2 will be to block the activate button until maintenance either fails or ends successfully. 

Haim, pleas open an RFE to allow to cancel preparation in maintenance by hitting the activate button. 

Thanks, 
Simon

Comment 12 Haim 2013-03-24 13:27:51 UTC
removing need-info, opened an RFE, still expect a fix here.

Comment 20 Michal Skrivanek 2013-07-09 12:14:05 UTC
after scrub mtg removing Regression

Comment 22 Barak 2013-07-22 06:40:24 UTC
There are 2 different issues in this bug:
1. activation of host that is in "preparing for maintenance" and have active 
   migrations going on ... stuck in unassigned.
2. VMs get stuck in "Migrating from" status although they had probably failed 
   migration as they had moved to PAUSED (the only reason I can think off is EIO)

Comment 23 Michal Skrivanek 2013-07-23 12:31:43 UTC
I'd be surprised the 2. is still valid

Comment 24 Andrew Cathrow 2013-07-23 12:32:18 UTC
Still in progress, for now moving to 3.2.3
Will review next mtg (30th july)