Bug 878131
| Summary: | Race between VM migration and other virt/storage operations | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||||
| Component: | ovirt-engine | Assignee: | Arik <ahadas> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Dafna Ron <dron> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 3.1.0 | CC: | bazulay, dyasny, eblake, hateya, iheim, lpeer, ofrenkel, Rhev-m-bugs, scohen, sgrinber, ykaul | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 3.2.0 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | virt | ||||||||
| Fixed In Version: | sf15 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 952147 (view as bug list) | Environment: | |||||||
| Last Closed: | Type: | Bug | |||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 952147 | ||||||||
| Attachments: |
|
||||||||
At the engine should prevent the two actions (live migration and live storage migration) to take place at the same time. qemu can't support live VM migration while a block mirror job is active, at least not until it supports persistent bitmaps. This is because libvirt has no way to recreate the mirroring on a new qemu instance on the destination of the VM migration. Libvirt should be rejecting migration attempts while any block job is active, to reflect this current qemu limitation. You have to finish any block mirroring job before attempting a VM migration. Similarly, it should not be possible to start a disk mirroring operation while live migration is underway - you have to finish VM migration before starting a block job. Thanks for the clarification Eric! We'll block it in engine then. Fede, any update on this? (In reply to comment #8) > Fede, any update on this? We should decide if blocking that only in VDSM or in the engine too. On the VDSM side it would be best if libvirt was blocking that for us (as it needs it anyway) returning a specific error. This would let us to automatically consume the persistent bitmaps once it will be ready. Eventually we can block it ourselves based on the vm status (but then we'll have to remove this check later on). Ayal do you want to block this on the engine side too? (same thing here, if we block it on the engine side we might have a better UI experience but then we'll need to remove the checks later). (In reply to comment #5) > qemu can't support live VM migration while a block mirror job is active, at > least not until it supports persistent bitmaps. This is because libvirt has > no way to recreate the mirroring on a new qemu instance on the destination > of the VM migration. Libvirt should be rejecting migration attempts while > any block job is active, to reflect this current qemu limitation. Eric are you working on this? We need both checks: no migration during blockjob and no blockjobs during migration. Eric, any update? On engine side we need to grey out the relevant button if possible and add a canDoAction to block running these operations concurrently. (In reply to comment #10) > Eric, any update? Still on my list of things to improve on the libvirt side. Merged the engine fix. tested on sf9 the migrate button is still available and there is no CanDoAction 2013-03-05 05:33:08,501 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (pool-3-thread-41) [b8148cd] Running command: MigrateVmToServerCommand internal: false. Entities affected : ID: 11b50fec-8aba-417c-9bf1-71b9d8b3613f Type: VM 2013-03-05 05:33:08,503 INFO [org.ovirt.engine.core.bll.VdsSelector] (pool-3-thread-41) [b8148cd] Checking for a specific VDS only - id:83834e1f-9e60-41b5-a9cc-16460a8a2fe2, name:gold-vdsd, host_name(ip):gold-vdsd.qa.lab.tlv.redhat.com 2013-03-05 05:33:08,516 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (pool-3-thread-41) [b8148cd] START, MigrateVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, vmId=11b50fec-8aba-417c-9bf1-71b9d8b3613f, srcHost=gold-vdsc.qa.lab.tlv.redhat.com, dstVdsId=83834e1f-9e60-41b5-a9cc-16460a8a2fe2, dstHost=gold-vdsd.qa.lab.tlv.redhat.com:54321, migrationMethod=ONLINE), log id: 59a098a4 full logs will be attached Created attachment 705511 [details]
logs
The issue here is broader than Live Storage Migration. MigrateVmCommand fails to take proper locks on the VM and its disks, which will inevetably cause a race between it and any other command the modifies the volumes the VM sees, e.g., Live Snapshot, Hotplug disk. The abandoned patch mentioned above (http://gerrit.ovirt.org/#/c/13263/) introduces such a mechanism which works for the initial run of the command, but cannot solve this bug due to the way the MigrateVm handles re-runs, without retaking any locks or checking canDoAction(). Moving to virt for a proper solution. tested on sf15. both buttons are still visible in UI - move disk is availble when image is not yet locked and although a CanDoAction is blocking the migration operation it's not aggregated to the UI (not with a CanDoAction error window or in event log). so I am moving back to devel: if buttons should be grayed out than they are not if only CanDoAction should resolve the issue than please make sure the user is aware of the failure. 2013-05-06 13:20:17,042 WARN [org.ovirt.engine.core.bll.MoveDisksCommand] (ajp-/127.0.0.1:8702-11) [23b810ef] CanDoAction of action MoveDisks failed. Reasons:VAR__ACTION__MOVE,VAR__TYPE__VM_DISK,$VmName Bla,ACTION_TYPE_FAILED_VM_IS_NOT _DOWN_OR_UP 2013-05-06 13:21:08,050 WARN [org.ovirt.engine.core.bll.MigrateVmCommand] (ajp-/127.0.0.1:8702-11) CanDoAction of action MigrateVm failed. Reasons:ACTION_TYPE_FAILED_DISKS_LOCKED,$diskAliases Bla_Disk1,VAR__ACTION__MIGRATE,VAR__TYPE__VM it's not practical to gray-out the buttons - there're too many races. so as requested by Dafna, I switch the status back to on_qa since it seems that the can-do-action window isn't shown in her environment in other scenarios as well. following comment #27 I am moving this to verified on sf15 3.2 has been released 3.2 has been released 3.2 has been released 3.2 has been released 3.2 has been released |
Created attachment 647901 [details] log Description of problem: I migrated a vm during live storage migration. vm migration fails because of libvirt error and than the live storage migration fails on disk replicate because of TimeoutError: Timed out during operation: cannot acquire state change lock Version-Release number of selected component (if applicable): si24.2 vdsm-4.9.6-43.0.el6_3.x86_64 How reproducible: 100% Steps to Reproduce: 1. create and start a vm on a two hosts cluster 2. start live migrate on the vm's disk 3. try to migrate the vm to the second host Actual results: migration fails and than disk replication fails as well. Expected results: we should be able to complete the live storage migration Additional info: logs this error can be found in hsm (where the vm was running on). Thread-8550::ERROR::2012-11-19 19:09:15,591::libvirtvm::2062::vm.Vm::(diskReplicateStart) vmId=`e3cb41b3-ab82-4803-a4cd-843093a32423`::Cannot complete the disk replication process Traceback (most recent call last): File "/usr/share/vdsm/libvirtvm.py", line 2053, in diskReplicateStart libvirt.VIR_DOMAIN_BLOCK_REBASE_SHALLOW File "/usr/share/vdsm/libvirtvm.py", line 526, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock Thread-8550::DEBUG::2012-11-19 19:09:15,623::BindingXMLRPC::900::vds::(wrapper) return vmDiskReplicateStart with {'status': {'message': 'Drive replication error', 'code': 55}} Thread-8575::DEBUG::2012-11-19 19:09:15,642::BindingXMLRPC::171::vds::(wrapper) [10.35.97.65]