Bug 783977 - [ovirt] [vdsm] call destroy is sent to vm when migration is canceled
Summary: [ovirt] [vdsm] call destroy is sent to vm when migration is canceled
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Shahar Havivi
QA Contact:
URL:
Whiteboard: virt
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-23 12:30 UTC by Haim
Modified: 2014-01-13 00:50 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-10-18 07:43:13 UTC
oVirt Team: ---


Attachments (Terms of Use)
vdsm log (432.98 KB, application/x-gzip)
2012-01-23 12:32 UTC, Haim
no flags Details

Description Haim 2012-01-23 12:30:00 UTC
Description of problem:

scenario: 

- start migration using web-admin 
- using vdsClient, use command to cancel migration thread 
- migration is stopped, however, backend 'call destroy' command is sent, killing the vm. 

flow: 

- migration is called from backend
- migration is started
- from client, abort the migration
- migration is aborted
- backend send getVmStats, and vdsm reports as if migration was succeeded
- vm gets killed by backend
- vm moves to unknown on backend - and down after several minutes 

why - it smells like a nasty race:

lets take the following case: 

- vmId = 75f8b814-8a85-4bb0-a428-523e0ec6875c

Thread-2429::DEBUG::2012-01-23 04:36:14,256::clientIF::76::vds::(wrapper) [10.16.144.104]::call migrate with ({'src': '10.16.144.166', 'dst': '10.16.144.164:54321', 'vmId': '75f8b814-8a85-4bb0-a428-523e0ec6875c', 'method': 'online'},) {}

- migration starts, and runs in different thread:

Thread-2430::DEBUG::2012-01-23 04:36:14,261::vm::120::vm.Vm::(_setupVdsConnection) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::Initiating connection with destination

- now I sent the migration cancel command:

Thread-2432::DEBUG::2012-01-23 04:36:16,688::clientIF::76::vds::(wrapper) [10.16.144.166]::call migrateCancel with ('75f8b814-8a85-4bb0-a428-523e0ec6875c',) {}

- migration was finished successfully: 

Thread-2438::DEBUG::2012-01-23 04:36:22,063::libvirtvm::317::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::migration downtime thread started
Thread-2439::DEBUG::2012-01-23 04:36:22,065::libvirtvm::345::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::starting migration monitor thread
Thread-2430::DEBUG::2012-01-23 04:36:22,065::libvirtvm::332::vm.Vm::(cancel) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::canceling migration downtime thread
Thread-2430::DEBUG::2012-01-23 04:36:22,079::libvirtvm::382::vm.Vm::(stop) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::stopping migration monitor thread
Thread-2438::DEBUG::2012-01-23 04:36:22,080::libvirtvm::329::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::migration downtime thread exiting
Thread-2430::DEBUG::2012-01-23 04:36:22,187::vm::898::vm.Vm::(setDownStatus) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::Changed state to Down: Migration succeeded

notes: 

- we need a way to avoid such races (add locking?)
- we need to set a point where migration cancel should return with error - saying, at this point we can't abort. 

attached logs.

Comment 1 Haim 2012-01-23 12:32:44 UTC
Created attachment 556957 [details]
vdsm log

Comment 2 Dan Kenigsberg 2012-01-23 12:44:07 UTC
Shahar, please see if this is the documented problem of

                #FIXME: there still a race here with libvirt,
                # if we call stop() and libvirt migrateToURI2 didn't start
                # we may return migration stop but it will start at libvirt
                # side

Comment 3 Shahar Havivi 2012-02-02 14:21:25 UTC
its looks like vdsm is moving the VM status to 'down',
and when the engine see the VM in status 'down' it call destroy().

this case is happened when doing starting migration on source, and from vdsClinet calling write an endless loop to stop the migration:

# while true; vdsClient -s 0 migrateCancel <vmid>; done

Comment 4 Shahar Havivi 2012-06-06 08:28:27 UTC
patch sent:
http://gerrit.ovirt.org/#/c/2533/

Comment 5 Shahar Havivi 2012-08-30 10:24:03 UTC
please check if still relevant

Comment 6 Haim 2012-08-30 10:29:29 UTC
(In reply to comment #5)
> please check if still relevant

no capacity of testing the flow again.

Comment 7 Michal Skrivanek 2012-10-19 10:51:33 UTC
different, but perhaps related bug: 867439


Note You need to log in before you can comment on or make changes to this bug.