Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 878131

Summary: Race between VM migration and other virt/storage operations
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED CURRENTRELEASE QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.0CC: bazulay, dyasny, eblake, hateya, iheim, lpeer, ofrenkel, Rhev-m-bugs, scohen, sgrinber, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: sf15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 952147 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 952147    
Attachments:
Description Flags
log
none
logs none

Description Dafna Ron 2012-11-19 17:27:28 UTC
Created attachment 647901 [details]
log

Description of problem:

I migrated a vm during live storage migration. 
vm migration fails because of libvirt error and than the live storage migration fails on disk replicate because of TimeoutError: Timed out during operation: cannot acquire state change lock

Version-Release number of selected component (if applicable):

si24.2
vdsm-4.9.6-43.0.el6_3.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create and start a vm on a two hosts cluster
2. start live migrate on the vm's disk
3. try to migrate the vm to the second host
  
Actual results:

migration fails and than disk replication fails as well. 

Expected results:

we should be able to complete the live storage migration

Additional info: logs

this error can be found in hsm (where the vm was running on). 

Thread-8550::ERROR::2012-11-19 19:09:15,591::libvirtvm::2062::vm.Vm::(diskReplicateStart) vmId=`e3cb41b3-ab82-4803-a4cd-843093a32423`::Cannot complete the disk replication process
Traceback (most recent call last):
  File "/usr/share/vdsm/libvirtvm.py", line 2053, in diskReplicateStart
    libvirt.VIR_DOMAIN_BLOCK_REBASE_SHALLOW
  File "/usr/share/vdsm/libvirtvm.py", line 526, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock
Thread-8550::DEBUG::2012-11-19 19:09:15,623::BindingXMLRPC::900::vds::(wrapper) return vmDiskReplicateStart with {'status': {'message': 'Drive replication error', 'code': 55}}
Thread-8575::DEBUG::2012-11-19 19:09:15,642::BindingXMLRPC::171::vds::(wrapper) [10.35.97.65]

Comment 1 Federico Simoncelli 2012-11-21 13:49:25 UTC
At the engine should prevent the two actions (live migration and live storage migration) to take place at the same time.

Comment 5 Eric Blake 2013-01-14 16:30:21 UTC
qemu can't support live VM migration while a block mirror job is active, at least not until it supports persistent bitmaps.  This is because libvirt has no way to recreate the mirroring on a new qemu instance on the destination of the VM migration.  Libvirt should be rejecting migration attempts while any block job is active, to reflect this current qemu limitation.  You have to finish any block mirroring job before attempting a VM migration.

Comment 6 Eric Blake 2013-01-14 16:38:27 UTC
Similarly, it should not be possible to start a disk mirroring operation while live migration is underway - you have to finish VM migration before starting a block job.

Comment 7 Ayal Baron 2013-01-14 23:00:34 UTC
Thanks for the clarification Eric!
We'll block it in engine then.

Comment 8 Ayal Baron 2013-01-28 14:38:19 UTC
Fede, any update on this?

Comment 9 Federico Simoncelli 2013-01-29 11:45:04 UTC
(In reply to comment #8)
> Fede, any update on this?

We should decide if blocking that only in VDSM or in the engine too.
On the VDSM side it would be best if libvirt was blocking that for us (as it needs it anyway) returning a specific error. This would let us to automatically consume the persistent bitmaps once it will be ready.

Eventually we can block it ourselves based on the vm status (but then we'll have to remove this check later on).

Ayal do you want to block this on the engine side too? (same thing here, if we block it on the engine side we might have a better UI experience but then we'll need to remove the checks later).

(In reply to comment #5)
> qemu can't support live VM migration while a block mirror job is active, at
> least not until it supports persistent bitmaps.  This is because libvirt has
> no way to recreate the mirroring on a new qemu instance on the destination
> of the VM migration.  Libvirt should be rejecting migration attempts while
> any block job is active, to reflect this current qemu limitation.

Eric are you working on this?
We need both checks: no migration during blockjob and no blockjobs during migration.

Comment 10 Ayal Baron 2013-02-04 08:42:27 UTC
Eric, any update?

Comment 11 Ayal Baron 2013-02-04 09:31:59 UTC
On engine side we need to grey out the relevant button if possible and add a canDoAction to block running these operations concurrently.

Comment 12 Eric Blake 2013-02-07 13:09:46 UTC
(In reply to comment #10)
> Eric, any update?

Still on my list of things to improve on the libvirt side.

Comment 13 Allon Mureinik 2013-02-25 09:23:51 UTC
Merged the engine fix.

Comment 14 Dafna Ron 2013-03-05 15:35:55 UTC
tested on sf9
the migrate button is still available and there is no CanDoAction

2013-03-05 05:33:08,501 INFO  [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (pool-3-thread-41) [b8148cd] Running command: MigrateVmToServerCommand internal: false. Entities affected :  ID: 11b50fec-8aba-417c-9bf1-71b9d8b3613f Type: VM
2013-03-05 05:33:08,503 INFO  [org.ovirt.engine.core.bll.VdsSelector] (pool-3-thread-41) [b8148cd] Checking for a specific VDS only - id:83834e1f-9e60-41b5-a9cc-16460a8a2fe2, name:gold-vdsd, host_name(ip):gold-vdsd.qa.lab.tlv.redhat.com
2013-03-05 05:33:08,516 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (pool-3-thread-41) [b8148cd] START, MigrateVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, vmId=11b50fec-8aba-417c-9bf1-71b9d8b3613f, srcHost=gold-vdsc.qa.lab.tlv.redhat.com, dstVdsId=83834e1f-9e60-41b5-a9cc-16460a8a2fe2, dstHost=gold-vdsd.qa.lab.tlv.redhat.com:54321, migrationMethod=ONLINE), log id: 59a098a4

full logs will be attached

Comment 15 Dafna Ron 2013-03-05 15:37:03 UTC
Created attachment 705511 [details]
logs

Comment 16 Allon Mureinik 2013-04-03 06:16:56 UTC
The issue here is broader than Live Storage Migration.

MigrateVmCommand fails to take proper locks on the VM and its disks, which will inevetably cause a race between it and any other command the modifies the volumes the VM sees, e.g., Live Snapshot, Hotplug disk.

The abandoned patch mentioned above (http://gerrit.ovirt.org/#/c/13263/) introduces such a mechanism which works for the initial run of the command, but cannot solve this bug due to the way the MigrateVm handles re-runs, without retaking any locks or checking canDoAction().

Moving to virt for a proper solution.

Comment 26 Dafna Ron 2013-05-06 10:30:21 UTC
tested on sf15.
both buttons are still visible in UI - move disk is availble when image is not yet locked and although a CanDoAction is blocking the migration operation it's not aggregated to the UI (not with a CanDoAction error window or in event log).
so I am moving back to devel:
if buttons should be grayed out than they are not
if only CanDoAction should resolve the issue than please make sure the user is aware of the failure.

2013-05-06 13:20:17,042 WARN  [org.ovirt.engine.core.bll.MoveDisksCommand] (ajp-/127.0.0.1:8702-11) [23b810ef] CanDoAction of action MoveDisks failed. Reasons:VAR__ACTION__MOVE,VAR__TYPE__VM_DISK,$VmName Bla,ACTION_TYPE_FAILED_VM_IS_NOT
_DOWN_OR_UP


2013-05-06 13:21:08,050 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand] (ajp-/127.0.0.1:8702-11) CanDoAction of action MigrateVm failed. Reasons:ACTION_TYPE_FAILED_DISKS_LOCKED,$diskAliases Bla_Disk1,VAR__ACTION__MIGRATE,VAR__TYPE__VM

Comment 27 Arik 2013-05-07 06:24:02 UTC
it's not practical to gray-out the buttons - there're too many races.
so as requested by Dafna, I switch the status back to on_qa since it seems that the can-do-action window isn't shown in her environment in other scenarios as well.

Comment 28 Dafna Ron 2013-05-07 07:28:24 UTC
following comment #27 I am moving this to verified on sf15

Comment 29 Itamar Heim 2013-06-11 09:00:34 UTC
3.2 has been released

Comment 30 Itamar Heim 2013-06-11 09:00:36 UTC
3.2 has been released

Comment 31 Itamar Heim 2013-06-11 09:00:46 UTC
3.2 has been released

Comment 32 Itamar Heim 2013-06-11 09:01:59 UTC
3.2 has been released

Comment 33 Itamar Heim 2013-06-11 09:31:18 UTC
3.2 has been released