Bug 1097256
| Summary: | 10 minute delay on migrating VMs out after requesting maintenance mode | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Julio Entrena Perez <jentrena> | ||||
| Component: | ovirt-engine | Assignee: | Arik <ahadas> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Lukas Svaty <lsvaty> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 3.3.0 | CC: | abisogia, adahms, ahadas, bazulay, dsulliva, flo_bugzilla, iheim, jentrena, lpeer, lsvaty, michal.skrivanek, mkalinin, ofrenkel, pbandark, pdwyer, rbalakri, rgolan, Rhev-m-bugs, sherold, tdosek, yeylon, ylavi | ||||
| Target Milestone: | --- | Keywords: | Regression, ZStream | ||||
| Target Release: | 3.5.0 | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | virt | ||||||
| Fixed In Version: | vt1.3 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Previously, virtual machines that failed to migrate to another host due to a maintenance operation on a host would cause deadlocks in the engine database. This would result in maintenance operations taking a long time to complete when virtual machines failed to migrate. Now, deadlocks no longer occur, allowing maintenance operations to complete more quickly when virtual machines fail to migrate.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1110126 1110146 (view as bug list) | Environment: | |||||
| Last Closed: | 2015-02-11 18:01:43 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1110126, 1110146, 1142923, 1156165 | ||||||
| Attachments: |
|
||||||
|
Description
Julio Entrena Perez
2014-05-13 12:44:14 UTC
This can be consistently reproduced on vdsm-4.13.2-0.13.el6ev and works as expected (no delay) on vdsm-4.13.2-0.11.el6ev. (In reply to Julio Entrena Perez from comment #2) > This can be consistently reproduced on vdsm-4.13.2-0.13.el6ev and works as > expected (no delay) on vdsm-4.13.2-0.11.el6ev. I'm not sure its related - can you reconfirm that and also make sure the host you take down is the one which is spm? the above mentioned fixed may cause this but only in 3.3 because we removed the TX from the Migrate commands. anyhow I think the reason the TX times out (10 minutes) is because we have 2 MaintenanceNumberOfVds commands - 1 is invoked by the user and the other is by the host monitoring (VURTI) this leads to a situation where the first is holding a TX and a row lock in db on this vds and waiting for the vds lockObj the secods is holding the vds lock and waiting to persist to db some vds data this explain why we don't see any GetList or GetVdsStats during this 10 minutes root cause: twofold - Migrate Commands opens global TX in 3.3 and the handling of PreparingForMaintenance is kicking in too early (it interleaves with the user call to do a Maintenance) (In reply to Roy Golan from comment #3) > > I'm not sure its related - can you reconfirm that and also make sure the > host you take down is the one which is spm? Indeed, I tried again and I can now reproduce the problem on a host running vdsm-4.13.2-0.11.el6ev too. Roy, so is the 3.4 implementation sufficient? If so, can we close it in 3.4 GA? This issue only seems to happen if maintenance mode is requested on the current SPM host. If requested on a non-SPM host (or if SPM is relocated to another host beforehand) VMs start migrating out of the host immediately after requesting maintenance mode. (In reply to Michal Skrivanek from comment #6) > Roy, so is the 3.4 implementation sufficient? If so, can we close it in 3.4 > GA? yes Bunch of patches which were already merged in 3.4 & u/s are needed to solve this bug in 3.3. Basically what happens is as follow: Let's say we have 2 VMs vm1 & vm2 which are running on host vds1. The user trigger 'switch to maintenance' on vds1: - vm1 begins to migrate and as part of the migration the pending memory on the destination host vds2 is increased in transaction. - the monitoring determines to re-trigger the 'switch to maintenance' operation (because of a bug that was already fixed u/s). - the reattempt to migrate vm1 fails on can-do-action (vm1 is locked by the previous migration) - let's say that before the first migration of vm1 tries to lock vds1 (in order to send the migrate operation to VDSM), the second 'switch to maintenance' tries to migrate vm2: so it tries to increase the pending memory on the destination host vds2 (assuming both VMs are migrating to the same host) but it is stuck because that vds is locked by the previous transaction. Now, since the second 'switch to maintenance' is invoked from within the monitoring, vds1 is locked by the monitoring. when the first migration will continue and try to lock vds1 we will get into a deadlock: - the migration of vm1 holds the vds2 in the DB (because of the transaction) and wants to lock vds1 - the migration of vm2 holds the lock of vds1 and tries to update vds2 in the DB The deadlock is solved when the transaction times-out, which is after 10 minutes. to verify this bug I suggest to have environment of 2 hosts and to run many VMs on one of them and then switch it to maintenance. I don't think the host on which the VMs run on must be SPM, I don't see how it affects this flow. you should see two MaintenanceNumberOfVdssCommand-s and that after the second one, migrations fail due to can-do-action (VM already being migrated) and after that migrations run. *** Bug 1105699 has been marked as a duplicate of this bug. *** What's the status on the zstreams here? 3.3.z? 3.4.z? I have a customer currently looking to deploy 3.3.z too much done to switch to 3.4 at this time? So what about bzs for the 3.3.z and 3.4.z? Not sure if this gives more information into the issue https://bugzilla.redhat.com/show_bug.cgi?id=1105699 I'll see if I can answer the SPM/NonSPM reference from comment #8 Created attachment 928254 [details]
Logs from verification process on ovirt-rc1
(In reply to Lukas Svaty from comment #21) Changing it back to ON_QA as it is not the reported bug - you don't see delay of 10 minutes right? Now regarding the reported scenario, are there the only 2 hosts in the environment? if so, what would you expect to happen? After further consultation with Arik scenario mentioned in comment#21 is correct. However for verification we should run this tests multiple time. Steps for verification proposed by Arik 1. Have multiple VMs (12) in environment with 2 hosts 2. Move host (SPM or noSPM should not matter) running VMs 3. wait 15min 4. in logs search for 2014-05-12 18:20:07,971 ERROR [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (Transaction Reaper Worker 0) Transaction rolled-back for command: org.ovirt.engine.core.bll.InternalMigrateVmCommand. For complete verification this will be run through the night endlessly and I'll check results tomorrow bug will be VERIFIED/FailedQA based on results tomorrow verified in ovirt-rc1 Description of original problem: When maintenance mode is requested on a host with running VMs there's a 10 minute delay on starting live migrating the running VMs out. Version-Release number of selected component (if applicable): vdsm-4.13.2-0.13.el6ev How reproducible: Always, at every maintenance mode request. Steps to Reproduce: 1. Request maintenance mode on a host with running VMs. 2. 3. Actual results: VMs start live migrating out of the host after a 10 minute delay. Expected results: VMs start live migrating out of the host immediately. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html |