Description of problem: When maintenance mode is requested on a host with running VMs there's a 10 minute delay on starting live migrating the running VMs out. Version-Release number of selected component (if applicable): vdsm-4.13.2-0.13.el6ev How reproducible: Always, at every maintenance mode request. Steps to Reproduce: 1. Request maintenance mode on a host with running VMs. 2. 3. Actual results: VMs start live migrating out of the host after a 10 minute delay. Expected results: VMs start live migrating out of the host immediately. Additional info: This regression could have been introduced by fix for bug 966503 / bug 1030367. Starting scenario: two host cluster rhevh3-375 2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a rhevh4-375 92b13c1a-c175-4e3e-b468-163f9863c854 (current SPM host) with two VMs running on rhevh4-375 and none running on rhevh3-375 rhel5a 6c31bb12-8e98-454d-92ab-2d61ed0dd2d8 rhel5b 1088854b-4156-490b-840a-8a9f2773cd12 - Maintenance mode on rhevh4-375 requested on 2014-05-12 at 18:10: engine.log 2014-05-12 18:10:05,897 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (ajp-/127.0.0.1:8702-6) [331b8c11] Lock Acquired to object EngineLock [exclusiveLocks= , sharedLocks= key: 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04 value: POOL] 2014-05-12 18:10:05,933 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (pool-4-thread-50) [331b8c11] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities affected : ID: 92b13c1a-c175-4e3e-b468-163f9863c854 Type: VDS 2014-05-12 18:10:05,934 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [331b8c11] START, SetVdsStatusVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, status=PreparingForMaintenance, nonOperationalReason=NONE, stopSpmFailureLogged=true), log id: 739dce3f - SPM role is taken out of rhevh4-375: 2014-05-12 18:10:05,934 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [331b8c11] VDS rhevh4-375 is spm and moved from up calling resetIrs. 2014-05-12 18:10:05,934 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (pool-4-thread-50) [331b8c11] START, ResetIrsVDSCommand( storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04, ignoreFailoverLimit = false, vdsId = 92b13c1a-c175-4e3e-b468-163f9863c854, ignoreStopFailed = false), log id: 2b20d3ed 2014-05-12 18:10:05,936 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-4-thread-50) [331b8c11] START, SpmStopVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04), log id: 39aac16e 2014-05-12 18:10:05,937 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-4-thread-50) [331b8c11] START, HSMGetAllTasksStatusesVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854), log id: 79813a55 2014-05-12 18:10:05,961 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-4-thread-50) [331b8c11] FINISH, HSMGetAllTasksStatusesVDSCommand, return: {}, log id: 79813a55 2014-05-12 18:10:05,961 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-4-thread-50) [331b8c11] SpmStopVDSCommand::Stopping SPM on vds rhevh4-375, pool id 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04 2014-05-12 18:10:07,689 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-71) START, SpmStatusVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04), log id: 17f27d83 - Live migration of rhel5b fails: 2014-05-12 18:10:07,875 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-4-thread-50) [1c0a2eec] Lock Acquired to object EngineLock [exclusiveLocks= key: 1088854b-4156-490b-840a-8a9f2773cd12 value: VM, sharedLocks= ] 2014-05-12 18:10:07,929 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [49d83eb3] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 1088854b-4156-490b-840a-8a9f2773cd12 value: VM, sharedLocks= ] 2014-05-12 18:10:07,930 WARN [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [49d83eb3] CanDoAction of action InternalMigrateVm failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED,$VmName rhel5b 2014-05-12 18:10:07,962 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (pool-4-thread-50) [1c0a2eec] Running command: InternalMigrateVmCommand internal: true. Entities affected : ID: 1088854b-4156-490b-840a-8a9f2773cd12 Type: VM 2014-05-12 18:10:07,971 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler_Worker-78) [49d83eb3] java.lang.NullPointerException 2014-05-12 18:10:07,971 ERROR [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (DefaultQuartzScheduler_Worker-78) [49d83eb3] ResourceManager::vdsMaintenance - Failed migrating desktop rhel5b - Live migration of rhel5a is requested: 2014-05-12 18:10:07,988 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] Lock Acquired to object EngineLock [exclusiveLocks= key: 6c31bb12-8e98-454d-92ab-2d61ed0dd2d8 value: VM , sharedLocks= ] 2014-05-12 18:10:07,996 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (pool-4-thread-50) [1c0a2eec] START, MigrateVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, vmId=1088854b-4156-490b-840a-8a9f2773cd12, srcHost=rhevh4-375.usersys.redhat.com, dstVdsId=2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, dstHost=rhevh3-375.usersys.redhat.com:54321, migrationMethod=ONLINE, tunnelMigration=false), log id: 46dfd02e 2014-05-12 18:10:08,037 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] Running command: InternalMigrateVmCommand internal: true. Entities affected : ID: 6c31bb12-8e98-454d-92ab-2d61ed0dd2d8 Type: VM - SPM starts on rhevh3-375: 2014-05-12 18:10:08,055 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] START, SpmStartVDSCommand(HostName = rhevh3-375, HostId = 2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04, prevId=2, prevLVER=9, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=true), log id: e465ab2 2014-05-12 18:10:08,088 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling started: taskId = d2e7ed36-86d2-43d7-aa1c-85d74ed38375 2014-05-12 18:10:09,113 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling - task status: running 2014-05-12 18:10:10,399 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling - task status: running 2014-05-12 18:10:11,458 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling - task status: running 2014-05-12 18:10:12,491 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling - task status: finished 2014-05-12 18:10:12,492 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling ended: taskId = d2e7ed36-86d2-43d7-aa1c-85d74ed38375 task status = finished 2014-05-12 18:10:12,493 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] START, SpmStatusVDSCommand(HostName = rhevh3-375, HostId = 2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, storagePoolId = 076c1118-82b3-40c3-b0c6-cf9d6c7aaf04), log id: 52df5d3d 2014-05-12 18:10:12,522 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] FINISH, SpmStatusVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@752795fb, log id: 52df5d3d 2014-05-12 18:10:12,522 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] spmStart polling ended, spm status: SPM 2014-05-12 18:10:12,523 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-71) [7d5dda46] START, HSMClearTaskVDSCommand(HostName = rhevh3-375, HostId = 2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, taskId=d2e7ed36-86d2-43d7-aa1c-85d74ed38375), log id: 79e7cdd3 - 10 minutes later: 2014-05-12 18:20:07,971 ERROR [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (Transaction Reaper Worker 0) Transaction rolled-back for command: org.ovirt.engine.core.bll.InternalMigrateVmCommand. 2014-05-12 18:20:07,977 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] START, MigrateVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, vmId=6c31bb12-8e98-454d-92ab-2d61ed0dd2d8, srcHost=rhevh4-375.usersys.redhat.com, dstVdsId=2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, dstHost=rhevh3-375.usersys.redhat.com:54321, migrationMethod=ONLINE, tunnelMigration=false), log id: 6a203ced 2014-05-12 18:20:07,980 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] VdsBroker::migrate::Entered (vm_guid=6c31bb12-8e98-454d-92ab-2d61ed0dd2d8, srcHost=rhevh4-375.usersys.redhat.com, dstHost=rhevh3-375.usersys.redhat.com:54321, method=online 2014-05-12 18:20:07,980 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] START, MigrateBrokerVDSCommand(HostName = rhevh4-375, HostId = 92b13c1a-c175-4e3e-b468-163f9863c854, vmId=6c31bb12-8e98-454d-92ab-2d61ed0dd2d8, srcHost=rhevh4-375.usersys.redhat.com, dstVdsId=2c88b5ff-e3d9-4e6b-b2a1-4c2a2cb8ff6a, dstHost=rhevh3-375.usersys.redhat.com:54321, migrationMethod=ONLINE, tunnelMigration=false), log id: 35c0eed9 2014-05-12 18:20:08,041 ERROR [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (Transaction Reaper Worker 0) Transaction rolled-back for command: org.ovirt.engine.core.bll.InternalMigrateVmCommand. 2014-05-12 18:20:08,201 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] FINISH, MigrateBrokerVDSCommand, log id: 35c0eed9 2014-05-12 18:20:08,203 ERROR [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] Command MigrateVDS execution failed. Exception: CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction 2014-05-12 18:20:08,203 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] FINISH, MigrateVDSCommand, log id: 6a203ced 2014-05-12 18:20:08,210 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] Lock freed to object EngineLock [exclusiveLocks= key: 6c31bb12-8e98-454d-92ab-2d61ed0dd2d8 value: VM, sharedLocks= ] 2014-05-12 18:20:08,210 ERROR [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (DefaultQuartzScheduler_Worker-78) [3a41deac] ResourceManager::vdsMaintenance - Failed migrating desktop rhel5a 2014-05-12 18:20:08,233 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-78) [3a41deac] Correlation ID: 6c1c8897, Job ID: 2b419968-4822-4c8d-9e73-99a78751c9f0, Call Stack: null, Custom Event ID: -1, Message: Failed to switch Host rhevh4-375 to Maintenance mode. 2014-05-12 18:20:08,400 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (pool-4-thread-50) [1c0a2eec] VdsBroker::migrate::Entered (vm_guid=1088854b-4156-490b-840a-8a9f2773cd12, srcHost=rhevh4-375.usersys.redhat.com, dstHost=rhevh3-375.usersys.redhat.com:54321, method=online - On vdsm.log at rhevh4-375 (source) and rhevh3-375 (destination) first migration call from engine is at 18:20 (10 minutes after maintenance mode has been requested): rhevh4: Thread-638::DEBUG::2014-05-12 18:20:08,180::BindingXMLRPC::975::vds::(wrapper) client [10.33.20.200]::call vmMigrate with ({'tunneled': 'false', 'src': 'rhevh4-375.usersys.redhat.com', 'dst': 'rhevh3-375.usersys.redhat.com:54321', 'vmId': '6c31bb12-8e98-454d-92ab-2d61ed0dd2d8', 'abortOnError': 'true', 'method': 'online'},) {} flowID [3a41deac] Thread-638::DEBUG::2014-05-12 18:20:08,180::API::469::vds::(migrate) {'tunneled': 'false', 'src': 'rhevh4-375.usersys.redhat.com', 'dst': 'rhevh3-375.usersys.redhat.com:54321', 'vmId': '6c31bb12-8e98-454d-92ab-2d61ed0dd2d8', 'abortOnError': 'true', 'method': 'online'} rhevh3: Thread-197::DEBUG::2014-05-12 18:20:08,558::BindingXMLRPC::975::vds::(wrapper) client [10.33.20.104]::call vmMigrationCreate with ({'username': 'Unknown', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.5.0', 'afterMigrationStatus': '', 'guestFQDN': '', 'vmId': '6c31bb12-8e98-454d-92ab-2d61ed0dd2d8' ...
This can be consistently reproduced on vdsm-4.13.2-0.13.el6ev and works as expected (no delay) on vdsm-4.13.2-0.11.el6ev.
(In reply to Julio Entrena Perez from comment #2) > This can be consistently reproduced on vdsm-4.13.2-0.13.el6ev and works as > expected (no delay) on vdsm-4.13.2-0.11.el6ev. I'm not sure its related - can you reconfirm that and also make sure the host you take down is the one which is spm?
the above mentioned fixed may cause this but only in 3.3 because we removed the TX from the Migrate commands. anyhow I think the reason the TX times out (10 minutes) is because we have 2 MaintenanceNumberOfVds commands - 1 is invoked by the user and the other is by the host monitoring (VURTI) this leads to a situation where the first is holding a TX and a row lock in db on this vds and waiting for the vds lockObj the secods is holding the vds lock and waiting to persist to db some vds data this explain why we don't see any GetList or GetVdsStats during this 10 minutes root cause: twofold - Migrate Commands opens global TX in 3.3 and the handling of PreparingForMaintenance is kicking in too early (it interleaves with the user call to do a Maintenance)
(In reply to Roy Golan from comment #3) > > I'm not sure its related - can you reconfirm that and also make sure the > host you take down is the one which is spm? Indeed, I tried again and I can now reproduce the problem on a host running vdsm-4.13.2-0.11.el6ev too.
Roy, so is the 3.4 implementation sufficient? If so, can we close it in 3.4 GA?
This issue only seems to happen if maintenance mode is requested on the current SPM host. If requested on a non-SPM host (or if SPM is relocated to another host beforehand) VMs start migrating out of the host immediately after requesting maintenance mode.
(In reply to Michal Skrivanek from comment #6) > Roy, so is the 3.4 implementation sufficient? If so, can we close it in 3.4 > GA? yes
Bunch of patches which were already merged in 3.4 & u/s are needed to solve this bug in 3.3. Basically what happens is as follow: Let's say we have 2 VMs vm1 & vm2 which are running on host vds1. The user trigger 'switch to maintenance' on vds1: - vm1 begins to migrate and as part of the migration the pending memory on the destination host vds2 is increased in transaction. - the monitoring determines to re-trigger the 'switch to maintenance' operation (because of a bug that was already fixed u/s). - the reattempt to migrate vm1 fails on can-do-action (vm1 is locked by the previous migration) - let's say that before the first migration of vm1 tries to lock vds1 (in order to send the migrate operation to VDSM), the second 'switch to maintenance' tries to migrate vm2: so it tries to increase the pending memory on the destination host vds2 (assuming both VMs are migrating to the same host) but it is stuck because that vds is locked by the previous transaction. Now, since the second 'switch to maintenance' is invoked from within the monitoring, vds1 is locked by the monitoring. when the first migration will continue and try to lock vds1 we will get into a deadlock: - the migration of vm1 holds the vds2 in the DB (because of the transaction) and wants to lock vds1 - the migration of vm2 holds the lock of vds1 and tries to update vds2 in the DB The deadlock is solved when the transaction times-out, which is after 10 minutes. to verify this bug I suggest to have environment of 2 hosts and to run many VMs on one of them and then switch it to maintenance. I don't think the host on which the VMs run on must be SPM, I don't see how it affects this flow. you should see two MaintenanceNumberOfVdssCommand-s and that after the second one, migrations fail due to can-do-action (VM already being migrated) and after that migrations run.
*** Bug 1105699 has been marked as a duplicate of this bug. ***
What's the status on the zstreams here? 3.3.z? 3.4.z? I have a customer currently looking to deploy 3.3.z too much done to switch to 3.4 at this time? So what about bzs for the 3.3.z and 3.4.z? Not sure if this gives more information into the issue https://bugzilla.redhat.com/show_bug.cgi?id=1105699 I'll see if I can answer the SPM/NonSPM reference from comment #8
Created attachment 928254 [details] Logs from verification process on ovirt-rc1
(In reply to Lukas Svaty from comment #21) Changing it back to ON_QA as it is not the reported bug - you don't see delay of 10 minutes right? Now regarding the reported scenario, are there the only 2 hosts in the environment? if so, what would you expect to happen?
After further consultation with Arik scenario mentioned in comment#21 is correct. However for verification we should run this tests multiple time. Steps for verification proposed by Arik 1. Have multiple VMs (12) in environment with 2 hosts 2. Move host (SPM or noSPM should not matter) running VMs 3. wait 15min 4. in logs search for 2014-05-12 18:20:07,971 ERROR [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (Transaction Reaper Worker 0) Transaction rolled-back for command: org.ovirt.engine.core.bll.InternalMigrateVmCommand. For complete verification this will be run through the night endlessly and I'll check results tomorrow bug will be VERIFIED/FailedQA based on results tomorrow
verified in ovirt-rc1
Description of original problem: When maintenance mode is requested on a host with running VMs there's a 10 minute delay on starting live migrating the running VMs out. Version-Release number of selected component (if applicable): vdsm-4.13.2-0.13.el6ev How reproducible: Always, at every maintenance mode request. Steps to Reproduce: 1. Request maintenance mode on a host with running VMs. 2. 3. Actual results: VMs start live migrating out of the host after a 10 minute delay. Expected results: VMs start live migrating out of the host immediately.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html