1097256 – 10 minute delay on migrating VMs out after requesting maintenance mode

Bug 1097256 - 10 minute delay on migrating VMs out after requesting maintenance mode

Summary: 10 minute delay on migrating VMs out after requesting maintenance mode

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Arik
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:	virt
Duplicates (1):	1105699 (view as bug list)
Depends On:
Blocks:	1110126 1110146 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-05-13 12:44 UTC by Julio Entrena Perez
Modified:	2019-10-10 09:17 UTC (History)
CC List:	22 users (show)
Fixed In Version:	vt1.3
Doc Type:	Bug Fix
Doc Text:	Previously, virtual machines that failed to migrate to another host due to a maintenance operation on a host would cause deadlocks in the engine database. This would result in maintenance operations taking a long time to complete when virtual machines failed to migrate. Now, deadlocks no longer occur, allowing maintenance operations to complete more quickly when virtual machines fail to migrate.
Clone Of:
Clones:	1110126 1110146 (view as bug list)
Environment:
Last Closed:	2015-02-11 18:01:43 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Logs from verification process on ovirt-rc1 (4.16 MB, application/x-tar) 2014-08-19 09:28 UTC, Lukas Svaty	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	882983	None	None	None	Never
Red Hat Product Errata	RHSA-2015:0158	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.5.0	2015-02-11 22:38:50 UTC
oVirt gerrit	24639	None	MERGED	core: fix NPE when migration caused by maintenance fails	2020-02-25 12:40:43 UTC
oVirt gerrit	24799	None	MERGED	core: make migrate commands non-transactive	2020-02-25 12:40:43 UTC
oVirt gerrit	25602	None	MERGED	core: minor cleanup in MigrateVDSCommand	2020-02-25 12:40:43 UTC
oVirt gerrit	28403	None	MERGED	core: fix reattempt to go to maintenance mechanism	2020-02-25 12:40:43 UTC

Description Julio Entrena Perez 2014-05-13 12:44:14 UTC

Description of problem:
When maintenance mode is requested on a host with running VMs there's a 10 minute delay on starting live migrating the running VMs out.

Version-Release number of selected component (if applicable):
vdsm-4.13.2-0.13.el6ev

How reproducible:
Always, at every maintenance mode request.

Steps to Reproduce:
1. Request maintenance mode on a host with running VMs.
2.
3.

Actual results:
VMs start live migrating out of the host after a 10 minute delay.

Expected results:
VMs start live migrating out of the host immediately.

Additional info:
This regression could have been introduced by fix for bug 966503 / bug 1030367.

Starting scenario: two host cluster

rhevh3-375 2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a
rhevh4-375 92b13c1a-c175-4e3e-b468-163f9863c854 (current SPM host)

with two VMs running on rhevh4-375 and none running on rhevh3-375

rhel5a 6c31bb12-8e98-454d-92ab-2d61ed0dd2d8
rhel5b 1088854b-4156-490b-840a-8a9f2773cd12

- Maintenance mode on rhevh4-375 requested on 2014-05-12 at 18:10:

  engine.log

2014-05-12 18:10:05,897 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (ajp-/127.0.0.1:8702-6) [331b8c11] Lock Acquired to object EngineLock [exclusiveLocks= , sharedLocks= key: 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04 value: POOL]
2014-05-12 18:10:05,933 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (pool-4-thread-50) [331b8c11] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities affected :  ID: 92b13c1a-c175-4e3e-b468-163f9863c854 Type: VDS
2014-05-12 18:10:05,934 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [331b8c11] START, SetVdsStatusVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, status=PreparingForMaintenance, nonOperationalReason=NONE, stopSpmFailureLogged=true), log id: 739dce3f

- SPM role is taken out of rhevh4-375:

2014-05-12 18:10:05,934 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [331b8c11] VDS rhevh4-375 is spm and moved from up calling resetIrs.
2014-05-12 18:10:05,934 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (pool-4-thread-50) [331b8c11] START, ResetIrsVDSCommand( storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04, ignoreFailoverLimit = false, vdsId = 92b13c1a-c175-4e3e-b468-163f9863c854, ignoreStopFailed = false), log id: 2b20d3ed
2014-05-12 18:10:05,936 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-4-thread-50) [331b8c11] START, SpmStopVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04), log id: 39aac16e
2014-05-12 18:10:05,937 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-4-thread-50) [331b8c11] START, HSMGetAllTasksStatusesVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854), log id: 79813a55
2014-05-12 18:10:05,961 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-4-thread-50) [331b8c11] FINISH, HSMGetAllTasksStatusesVDSCommand, return: {}, log id: 79813a55
2014-05-12 18:10:05,961 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-4-thread-50) [331b8c11] SpmStopVDSCommand::Stopping SPM on vds rhevh4-375, pool id 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04
2014-05-12 18:10:07,689 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-71) START, SpmStatusVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04), log id: 17f27d83

- Live migration of rhel5b fails:

2014-05-12 18:10:07,875 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-4-thread-50) [1c0a2eec] Lock Acquired to object EngineLock [exclusiveLocks= key: 1088854b-4156-490b-840a-8a9f2773cd12 value: VM, sharedLocks= ]

2014-05-12 18:10:07,929 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [49d83eb3] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 1088854b-4156-490b-840a-8a9f2773cd12 value: VM, sharedLocks= ]
2014-05-12 18:10:07,930 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [49d83eb3] CanDoAction of action InternalMigrateVm failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED,$VmName rhel5b
2014-05-12 18:10:07,962 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-4-thread-50) [1c0a2eec] Running command: InternalMigrateVmCommand internal: true. Entities affected :  ID: 1088854b-4156-490b-840a-8a9f2773cd12 Type: VM
2014-05-12 18:10:07,971 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler_Worker-78) [49d83eb3] java.lang.NullPointerException
2014-05-12 18:10:07,971 ERROR [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (DefaultQuartzScheduler_Worker-78) [49d83eb3] ResourceManager::vdsMaintenance - Failed migrating desktop rhel5b

- Live migration of rhel5a is requested:

2014-05-12 18:10:07,988 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] Lock Acquired to object EngineLock [exclusiveLocks= key: 6c31bb12-8e98-454d-92ab-2d61ed0dd2d8 value: VM
, sharedLocks= ]
2014-05-12 18:10:07,996 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (pool-4-thread-50) [1c0a2eec] START, MigrateVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, vmId=1088854b-4156-490b-840a-8a9f2773cd12, srcHost=rhevh4-375.usersys.redhat.com, dstVdsId=2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, dstHost=rhevh3-375.usersys.redhat.com:54321, migrationMethod=ONLINE, tunnelMigration=false), log id: 46dfd02e

2014-05-12 18:10:08,037 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] Running command: InternalMigrateVmCommand internal: true. Entities affected :  ID: 6c31bb12-8e98-454d-92ab-2d61ed0dd2d8 Type: VM

- SPM starts on rhevh3-375:

2014-05-12 18:10:08,055 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] START, SpmStartVDSCommand(HostName = rhevh3-375, HostId = 2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04, prevId=2, prevLVER=9, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=true), log id: e465ab2
2014-05-12 18:10:08,088 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling started: taskId = d2e7ed36-86d2-43d7-aa1c-85d74ed38375
2014-05-12 18:10:09,113 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling - task status: running
2014-05-12 18:10:10,399 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling - task status: running
2014-05-12 18:10:11,458 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling - task status: running
2014-05-12 18:10:12,491 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling - task status: finished
2014-05-12 18:10:12,492 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling ended: taskId = d2e7ed36-86d2-43d7-aa1c-85d74ed38375 task status = finished
2014-05-12 18:10:12,493 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] START, SpmStatusVDSCommand(HostName = rhevh3-375, HostId = 2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04), log id: 52df5d3d
2014-05-12 18:10:12,522 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] FINISH, SpmStatusVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@752795fb, log id: 52df5d3d
2014-05-12 18:10:12,522 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling ended, spm status: SPM
2014-05-12 18:10:12,523 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] START, HSMClearTaskVDSCommand(HostName = rhevh3-375, HostId = 2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, taskId=d2e7ed36-86d2-43d7-aa1c-85d74ed38375), log id: 79e7cdd3

- 10 minutes later:

2014-05-12 18:20:07,971 ERROR [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (Transaction Reaper Worker 0) Transaction rolled-back for command: org.ovirt.engine.core.bll.InternalMigrateVmCommand.
2014-05-12 18:20:07,977 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] START, MigrateVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, vmId=6c31bb12-8e98-454d-92ab-2d61ed0dd2d8, srcHost=rhevh4-375.usersys.redhat.com, dstVdsId=2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, dstHost=rhevh3-375.usersys.redhat.com:54321, migrationMethod=ONLINE, tunnelMigration=false), log id: 6a203ced
2014-05-12 18:20:07,980 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] VdsBroker::migrate::Entered (vm_guid=6c31bb12-8e98-454d-92ab-2d61ed0dd2d8, srcHost=rhevh4-375.usersys.redhat.com, dstHost=rhevh3-375.usersys.redhat.com:54321,  method=online
2014-05-12 18:20:07,980 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] START, MigrateBrokerVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, vmId=6c31bb12-8e98-454d-92ab-2d61ed0dd2d8, srcHost=rhevh4-375.usersys.redhat.com, dstVdsId=2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, dstHost=rhevh3-375.usersys.redhat.com:54321, migrationMethod=ONLINE, tunnelMigration=false), log id: 35c0eed9
2014-05-12 18:20:08,041 ERROR [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (Transaction Reaper Worker 0) Transaction rolled-back for command: org.ovirt.engine.core.bll.InternalMigrateVmCommand.
2014-05-12 18:20:08,201 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] FINISH, MigrateBrokerVDSCommand, log id: 35c0eed9
2014-05-12 18:20:08,203 ERROR [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] Command MigrateVDS execution failed. Exception: CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction
2014-05-12 18:20:08,203 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] FINISH, MigrateVDSCommand, log id: 6a203ced
2014-05-12 18:20:08,210 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] Lock freed to object EngineLock [exclusiveLocks= key: 6c31bb12-8e98-454d-92ab-2d61ed0dd2d8 value: VM, sharedLocks= ]
2014-05-12 18:20:08,210 ERROR [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] ResourceManager::vdsMaintenance - Failed migrating desktop rhel5a
2014-05-12 18:20:08,233 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-78) [3a41deac] Correlation ID: 6c1c8897, Job ID: 2b419968-4822-4c8d-9e73-99a78751c9f0, Call Stack: null, Custom Event ID: -1, Message: Failed to switch Host rhevh4-375 to Maintenance mode.
2014-05-12 18:20:08,400 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (pool-4-thread-50) [1c0a2eec] VdsBroker::migrate::Entered (vm_guid=1088854b-4156-490b-840a-8a9f2773cd12, srcHost=rhevh4-375.usersys.redhat.com, dstHost=rhevh3-375.usersys.redhat.com:54321,  method=online

- On vdsm.log at rhevh4-375 (source) and rhevh3-375 (destination) first migration call from engine is at 18:20 (10 minutes after maintenance mode has been requested):

rhevh4:

Thread-638::DEBUG::2014-05-12 18:20:08,180::BindingXMLRPC::975::vds::(wrapper) client [10.33.20.200]::call vmMigrate with ({'tunneled': 'false', 'src': 'rhevh4-375.usersys.redhat.com', 'dst': 'rhevh3-375.usersys.redhat.com:54321', 'vmId': '6c31bb12-8e98-454d-92ab-2d61ed0dd2d8', 'abortOnError': 'true', 'method': 'online'},) {} flowID [3a41deac]
Thread-638::DEBUG::2014-05-12 18:20:08,180::API::469::vds::(migrate) {'tunneled': 'false', 'src': 'rhevh4-375.usersys.redhat.com', 'dst': 'rhevh3-375.usersys.redhat.com:54321', 'vmId': '6c31bb12-8e98-454d-92ab-2d61ed0dd2d8', 'abortOnError': 'true', 'method': 'online'}

rhevh3:

Thread-197::DEBUG::2014-05-12 18:20:08,558::BindingXMLRPC::975::vds::(wrapper) client [10.33.20.104]::call vmMigrationCreate with ({'username': 'Unknown', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.5.0', 'afterMigrationStatus': '', 'guestFQDN': '', 'vmId': '6c31bb12-8e98-454d-92ab-2d61ed0dd2d8' ...

Comment 2 Julio Entrena Perez 2014-05-13 13:01:18 UTC

This can be consistently reproduced on vdsm-4.13.2-0.13.el6ev and works as expected (no delay) on vdsm-4.13.2-0.11.el6ev.

Comment 3 Roy Golan 2014-05-21 10:44:48 UTC

(In reply to Julio Entrena Perez from comment #2)
> This can be consistently reproduced on vdsm-4.13.2-0.13.el6ev and works as
> expected (no delay) on vdsm-4.13.2-0.11.el6ev.

I'm not sure its related - can you reconfirm that and also make sure the host you take down is the one which is spm?

Comment 4 Roy Golan 2014-05-21 10:51:38 UTC

the above mentioned fixed may cause this but only in 3.3 because we removed the TX from the Migrate commands. 

anyhow I think the reason the TX times out (10 minutes) is because we have 2 MaintenanceNumberOfVds commands - 1 is invoked by the user and the other is by the host monitoring (VURTI)

this leads to a situation where the first is holding a TX and a row lock in db on this vds and waiting for the vds lockObj 

the secods is holding the vds lock and waiting to persist to db some vds data 

this explain why we don't see any GetList or GetVdsStats during this 10 minutes


root cause: twofold - Migrate Commands opens global TX in 3.3 and the handling of PreparingForMaintenance is kicking in too early (it interleaves with the user call to do a Maintenance)

Comment 5 Julio Entrena Perez 2014-05-22 09:30:43 UTC

(In reply to Roy Golan from comment #3)
> 
> I'm not sure its related - can you reconfirm that and also make sure the
> host you take down is the one which is spm?

Indeed, I tried again and I can now reproduce the problem on a host running vdsm-4.13.2-0.11.el6ev too.

Comment 6 Michal Skrivanek 2014-05-23 09:07:09 UTC

Roy, so is the 3.4 implementation sufficient? If so, can we close it in 3.4 GA?

Comment 8 Julio Entrena Perez 2014-05-30 13:04:05 UTC

This issue only seems to happen if maintenance mode is requested on the current SPM host.

If requested on a non-SPM host (or if SPM is relocated to another host beforehand) VMs start migrating out of the host immediately after requesting maintenance mode.

Comment 9 Roy Golan 2014-06-02 07:42:13 UTC

(In reply to Michal Skrivanek from comment #6)
> Roy, so is the 3.4 implementation sufficient? If so, can we close it in 3.4
> GA?

yes

Comment 12 Arik 2014-06-12 07:29:14 UTC

Bunch of patches which were already merged in 3.4 & u/s are needed to solve this bug in 3.3.

Basically what happens is as follow:
Let's say we have 2 VMs vm1 & vm2 which are running on host vds1.
The user trigger 'switch to maintenance' on vds1:
- vm1 begins to migrate and as part of the migration the pending memory on the destination host vds2 is increased in transaction.
- the monitoring determines to re-trigger the 'switch to maintenance' operation (because of a bug that was already fixed u/s).
- the reattempt to migrate vm1 fails on can-do-action (vm1 is locked by the previous migration)
- let's say that before the first migration of vm1 tries to lock vds1 (in order to send the migrate operation to VDSM), the second 'switch to maintenance' tries to migrate vm2: so it tries to increase the pending memory on the destination host vds2 (assuming both VMs are migrating to the same host) but it is stuck because that vds is locked by the previous transaction.

Now, since the second 'switch to maintenance' is invoked from within the monitoring, vds1 is locked by the monitoring. when the first migration will continue and try to lock vds1 we will get into a deadlock:
- the migration of vm1 holds the vds2 in the DB (because of the transaction) and wants to lock vds1
- the migration of vm2 holds the lock of vds1 and tries to update vds2 in the DB

The deadlock is solved when the transaction times-out, which is after 10 minutes.

to verify this bug I suggest to have environment of 2 hosts and to run many VMs on one of them and then switch it to maintenance. I don't think the host on which the VMs run on must be SPM, I don't see how it affects this flow. you should see two MaintenanceNumberOfVdssCommand-s and that after the second one, migrations fail due to can-do-action (VM already being migrated) and after that migrations run.

Comment 16 Omer Frenkel 2014-06-16 07:22:32 UTC

*** Bug 1105699 has been marked as a duplicate of this bug. ***

Comment 17 Dave Sullivan 2014-06-16 12:59:33 UTC

What's the status on the zstreams here?

3.3.z?
3.4.z?

I have a customer currently looking to deploy 3.3.z too much done to switch to 3.4 at this time?

So what about bzs for the 3.3.z and 3.4.z?

Not sure if this gives more information into the issue 
https://bugzilla.redhat.com/show_bug.cgi?id=1105699

I'll see if I can answer the SPM/NonSPM reference from comment #8

Comment 22 Lukas Svaty 2014-08-19 09:28:31 UTC

Created attachment 928254 [details]
Logs from verification process on ovirt-rc1

Comment 23 Arik 2014-08-19 10:50:09 UTC

(In reply to Lukas Svaty from comment #21)
Changing it back to ON_QA as it is not the reported bug - you don't see delay of 10 minutes right?

Now regarding the reported scenario, are there the only 2 hosts in the environment? if so, what would you expect to happen?

Comment 24 Lukas Svaty 2014-08-19 12:31:08 UTC

After further consultation with Arik scenario mentioned in comment#21 is correct. However for verification we should run this tests multiple time.

Steps for verification proposed by Arik

1. Have multiple VMs (12) in environment with 2 hosts
2. Move host (SPM or noSPM should not matter) running VMs
3. wait 15min
4. in logs search for 
2014-05-12 18:20:07,971 ERROR [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (Transaction Reaper Worker 0) Transaction rolled-back for command: org.ovirt.engine.core.bll.InternalMigrateVmCommand.

For complete verification this will be run through the night endlessly and I'll check results tomorrow

bug will be VERIFIED/FailedQA based on results tomorrow

Comment 25 Lukas Svaty 2014-08-20 07:35:02 UTC

verified in ovirt-rc1

Comment 26 Yaniv Lavi 2015-01-19 17:46:03 UTC

Description of original problem:
When maintenance mode is requested on a host with running VMs there's a 10 minute delay on starting live migrating the running VMs out.

Version-Release number of selected component (if applicable):
vdsm-4.13.2-0.13.el6ev

How reproducible:
Always, at every maintenance mode request.

Steps to Reproduce:
1. Request maintenance mode on a host with running VMs.
2.
3.

Actual results:
VMs start live migrating out of the host after a 10 minute delay.

Expected results:
VMs start live migrating out of the host immediately.

Comment 30 errata-xmlrpc 2015-02-11 18:01:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.

abisogia
adahms
ahadas
bazulay
dsulliva
flo_bugzilla
iheim
jentrena
lpeer
lsvaty
michal.skrivanek
mkalinin
ofrenkel
pbandark
pdwyer
rbalakri
rgolan
Rhev-m-bugs
sherold
tdosek
yeylon
ylavi