Bug 783977

Summary: [ovirt] [vdsm] call destroy is sent to vm when migration is canceled
Product: [Retired] oVirt Reporter: Haim <hateya>
Component: vdsmAssignee: Shahar Havivi <shavivi>
Status: CLOSED WORKSFORME QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: abaron, acathrow, bazulay, iheim, mgoldboi, michal.skrivanek, yeylon, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-18 07:43:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
vdsm log none

Description Haim 2012-01-23 12:30:00 UTC
Description of problem:

scenario: 

- start migration using web-admin 
- using vdsClient, use command to cancel migration thread 
- migration is stopped, however, backend 'call destroy' command is sent, killing the vm. 

flow: 

- migration is called from backend
- migration is started
- from client, abort the migration
- migration is aborted
- backend send getVmStats, and vdsm reports as if migration was succeeded
- vm gets killed by backend
- vm moves to unknown on backend - and down after several minutes 

why - it smells like a nasty race:

lets take the following case: 

- vmId = 75f8b814-8a85-4bb0-a428-523e0ec6875c

Thread-2429::DEBUG::2012-01-23 04:36:14,256::clientIF::76::vds::(wrapper) [10.16.144.104]::call migrate with ({'src': '10.16.144.166', 'dst': '10.16.144.164:54321', 'vmId': '75f8b814-8a85-4bb0-a428-523e0ec6875c', 'method': 'online'},) {}

- migration starts, and runs in different thread:

Thread-2430::DEBUG::2012-01-23 04:36:14,261::vm::120::vm.Vm::(_setupVdsConnection) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::Initiating connection with destination

- now I sent the migration cancel command:

Thread-2432::DEBUG::2012-01-23 04:36:16,688::clientIF::76::vds::(wrapper) [10.16.144.166]::call migrateCancel with ('75f8b814-8a85-4bb0-a428-523e0ec6875c',) {}

- migration was finished successfully: 

Thread-2438::DEBUG::2012-01-23 04:36:22,063::libvirtvm::317::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::migration downtime thread started
Thread-2439::DEBUG::2012-01-23 04:36:22,065::libvirtvm::345::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::starting migration monitor thread
Thread-2430::DEBUG::2012-01-23 04:36:22,065::libvirtvm::332::vm.Vm::(cancel) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::canceling migration downtime thread
Thread-2430::DEBUG::2012-01-23 04:36:22,079::libvirtvm::382::vm.Vm::(stop) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::stopping migration monitor thread
Thread-2438::DEBUG::2012-01-23 04:36:22,080::libvirtvm::329::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::migration downtime thread exiting
Thread-2430::DEBUG::2012-01-23 04:36:22,187::vm::898::vm.Vm::(setDownStatus) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::Changed state to Down: Migration succeeded

notes: 

- we need a way to avoid such races (add locking?)
- we need to set a point where migration cancel should return with error - saying, at this point we can't abort. 

attached logs.

Comment 1 Haim 2012-01-23 12:32:44 UTC
Created attachment 556957 [details]
vdsm log

Comment 2 Dan Kenigsberg 2012-01-23 12:44:07 UTC
Shahar, please see if this is the documented problem of

                #FIXME: there still a race here with libvirt,
                # if we call stop() and libvirt migrateToURI2 didn't start
                # we may return migration stop but it will start at libvirt
                # side

Comment 3 Shahar Havivi 2012-02-02 14:21:25 UTC
its looks like vdsm is moving the VM status to 'down',
and when the engine see the VM in status 'down' it call destroy().

this case is happened when doing starting migration on source, and from vdsClinet calling write an endless loop to stop the migration:

# while true; vdsClient -s 0 migrateCancel <vmid>; done

Comment 4 Shahar Havivi 2012-06-06 08:28:27 UTC
patch sent:
http://gerrit.ovirt.org/#/c/2533/

Comment 5 Shahar Havivi 2012-08-30 10:24:03 UTC
please check if still relevant

Comment 6 Haim 2012-08-30 10:29:29 UTC
(In reply to comment #5)
> please check if still relevant

no capacity of testing the flow again.

Comment 7 Michal Skrivanek 2012-10-19 10:51:33 UTC
different, but perhaps related bug: 867439