Hide Forgot
Description of problem: scenario: - start migration using web-admin - using vdsClient, use command to cancel migration thread - migration is stopped, however, backend 'call destroy' command is sent, killing the vm. flow: - migration is called from backend - migration is started - from client, abort the migration - migration is aborted - backend send getVmStats, and vdsm reports as if migration was succeeded - vm gets killed by backend - vm moves to unknown on backend - and down after several minutes why - it smells like a nasty race: lets take the following case: - vmId = 75f8b814-8a85-4bb0-a428-523e0ec6875c Thread-2429::DEBUG::2012-01-23 04:36:14,256::clientIF::76::vds::(wrapper) [10.16.144.104]::call migrate with ({'src': '10.16.144.166', 'dst': '10.16.144.164:54321', 'vmId': '75f8b814-8a85-4bb0-a428-523e0ec6875c', 'method': 'online'},) {} - migration starts, and runs in different thread: Thread-2430::DEBUG::2012-01-23 04:36:14,261::vm::120::vm.Vm::(_setupVdsConnection) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::Initiating connection with destination - now I sent the migration cancel command: Thread-2432::DEBUG::2012-01-23 04:36:16,688::clientIF::76::vds::(wrapper) [10.16.144.166]::call migrateCancel with ('75f8b814-8a85-4bb0-a428-523e0ec6875c',) {} - migration was finished successfully: Thread-2438::DEBUG::2012-01-23 04:36:22,063::libvirtvm::317::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::migration downtime thread started Thread-2439::DEBUG::2012-01-23 04:36:22,065::libvirtvm::345::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::starting migration monitor thread Thread-2430::DEBUG::2012-01-23 04:36:22,065::libvirtvm::332::vm.Vm::(cancel) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::canceling migration downtime thread Thread-2430::DEBUG::2012-01-23 04:36:22,079::libvirtvm::382::vm.Vm::(stop) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::stopping migration monitor thread Thread-2438::DEBUG::2012-01-23 04:36:22,080::libvirtvm::329::vm.Vm::(run) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::migration downtime thread exiting Thread-2430::DEBUG::2012-01-23 04:36:22,187::vm::898::vm.Vm::(setDownStatus) vmId=`75f8b814-8a85-4bb0-a428-523e0ec6875c`::Changed state to Down: Migration succeeded notes: - we need a way to avoid such races (add locking?) - we need to set a point where migration cancel should return with error - saying, at this point we can't abort. attached logs.
Created attachment 556957 [details] vdsm log
Shahar, please see if this is the documented problem of #FIXME: there still a race here with libvirt, # if we call stop() and libvirt migrateToURI2 didn't start # we may return migration stop but it will start at libvirt # side
its looks like vdsm is moving the VM status to 'down', and when the engine see the VM in status 'down' it call destroy(). this case is happened when doing starting migration on source, and from vdsClinet calling write an endless loop to stop the migration: # while true; vdsClient -s 0 migrateCancel <vmid>; done
patch sent: http://gerrit.ovirt.org/#/c/2533/
please check if still relevant
(In reply to comment #5) > please check if still relevant no capacity of testing the flow again.
different, but perhaps related bug: 867439