Description of problem: After attempting to cancel multiple live migrations some guests are killed. Version-Release number of selected component (if applicable): rhevm-backend-3.2.3-0.43.el6ev libvirt-0.10.2-18.el6_4.14 vdsm-4.10.2-25.1.el6ev How reproducible: Always. Steps to Reproduce: 1. Start 5 live migrations. 2. Wait for webadmin portal to display the hourglass for each guest. 3. Cancel the 5 live migrations. Actual results: 3 of the guests are killed. Expected results: None of the guests are killed. Additional info: engine calls DestroyVDSCommand twice on both source and destination hosts after CancelMigrateVDS fails.
Two hosts: rhevh1 dec9a48a-6ad4-4ee3-b248-ae4fa51c1d05 rhevh2 02427f0e-a117-11e2-b83d-0010182d03a8 Five guests: mc1 a1226783-b858-4c08-a333-16fd59e42c36 mc2 96d065d6-61f8-4c70-ac8b-b76553e498b5 mc3 c291a3e9-9952-4162-822d-348d041e066d mc4 1d661d96-422a-45b0-a43f-f940900a8339 rhel5a 69b87dab-b675-4efd-8075-d3fc312b3b2c On 24/10/2013 at 16:17 all 5 guests are requested to live migrate from rhevh1 to rhevh2. Shortly afterwards all five live migrations are requested to cancel. Guests mc2 and mc3 survive in rhevh1 host as expected. Guests mc1, mc4 and rhel5a are killed.
Created attachment 815855 [details] snip of engine.log for mc1 guest (killed)
Created attachment 815856 [details] snip of engine.log for mc2 guest (not killed)
Created attachment 815857 [details] snip of engine.log for mc3 guest (not killed)
Created attachment 815858 [details] snip of engine.log for mc4 guest (killed)
Created attachment 815859 [details] snip of engine.log for rhel5a guest (killed)
qemu kills the VM on the source rhevh1, during migration, due to network iface related error ### libvirt.log 2013-10-24 15:18:11.983+0000: 2392: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f7a780cf520 buf={"timestamp": {"seconds": 1382627891, "microseconds": 983395}, "event": "SHUTDOWN"} len=85 2013-10-24 15:18:11.983+0000: 2392: debug : qemuMonitorEmitShutdown:988 : mon=0x7f7a780cf520 2013-10-24 15:18:11.983+0000: 2392: debug : qemuProcessHandleShutdown:654 : vm=0x7f7a780d5ce0 2013-10-24 15:18:12.367+0000: 2404: debug : qemuDomainObjBeginJobInternal:808 : Starting job: destroy (async=none) 2013-10-24 15:18:12.407+0000: 2404: debug : qemuProcessStop:4193 : Shutting down VM 'mc1' pid=11077 flags=0 2013-10-24 15:18:12.408+0000: 2404: error : virNWFilterDHCPSnoopEnd:2131 : internal error ifname "vnet1" not in key map 2013-10-24 15:18:12.458+0000: 2404: error : virNetDevGetIndex:653 : Unable to get index for interface vnet1: No such device VDSM on source sets the VM to down -> engine sends a destroy VDSM on destination has no update on the socket for 5 minutes so setting vm to down -> engine sends the second destory
This is a VDSM bug: see the cancelMigrate hapend before the Migrate hapend. which leaves a boolean flag _migrationCanceledEvt set to "True" so migrateToUri2 is never called: from vm.py if not self._migrationCanceledEvt: self._vm._dom.migrateToURI2( but the finally clause is clearing the VM. which also clears the interfaces and sets the status to Down startUnderlyingMigration() ... finally: t.cancel() if MigrationMonitorThread._MIGRATION_MONITOR_INTERVAL: self._monitorThread.stop() t.cancel() is set the _stop event which shutsdown the VM
correction, its the finishSucessfully method which is being called which sets the vm status to down and then the backend sends a destroy as part of a regular migration flow.
Changed merged to u/s master as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=4b369364084d662460967029fc0d64cb60884ed7
Verified on is27. By flow in description. None of the guests were killed. engine.log for a VM for which migration cancelled: 2013-12-16 17:00:44,337 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (pool-4-thread-49) [4c001553] Candidate host silver-vdsc.qa.lab.tlv.redhat.com (5f9f30ff-9bfd-449d-97eb-6e8d6e0d7a02) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory (correlation id: 4c001553) 2013-12-16 17:00:44,339 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] (pool-4-thread-49) [4c001553] Command org.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) 2013-12-16 17:00:44,350 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] (pool-4-thread-49) [4c001553] Transaction rolled-back for command: org.ovirt.engine.core.bll.MigrateVmCommand. 2013-12-16 17:00:44,373 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-49) [4c001553] Correlation ID: 4c001553, Job ID: 0846178a-1420-4b0c-9a0b-885e96bce384, Call Stack: null, Custom Event ID: -1, Message: Migration failed (VM: lin3, Source: cyan-vdse.qa.lab.tlv.redhat.com, Destination: <UNKNOWN>).
(In reply to Ilanit Stein from comment #15) > Verified on is27. > > By flow in description. > > None of the guests were killed. > > engine.log for a VM for which migration cancelled: > > 2013-12-16 17:00:44,337 INFO > [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (pool-4-thread-49) > [4c001553] Candidate host silver-vdsc.qa.lab.tlv.redhat.com > (5f9f30ff-9bfd-449d-97eb-6e8d6e0d7a02) was filtered out by > VAR__FILTERTYPE__INTERNAL filter Memory (correlation id: 4c001553) > 2013-12-16 17:00:44,339 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] > (pool-4-thread-49) [4c001553] Command > org.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With > error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with > error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) > 2013-12-16 17:00:44,350 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] > (pool-4-thread-49) [4c001553] Transaction rolled-back for command: > org.ovirt.engine.core.bll.MigrateVmCommand. > 2013-12-16 17:00:44,373 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (pool-4-thread-49) [4c001553] Correlation ID: 4c001553, Job ID: > 0846178a-1420-4b0c-9a0b-885e96bce384, Call Stack: null, Custom Event ID: -1, > Message: Migration failed (VM: lin3, Source: > cyan-vdse.qa.lab.tlv.redhat.com, Destination: <UNKNOWN>). this doesn't look good at all. the migrate command was never called. rg.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With > error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with > error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) this needs verification again.
I tried to verify this one - 1. 5 VMs running, 2. Migrate all 3. while migration cancel migration 3 VMs migrated, 2 didn't from engine log: 2014-01-02 17:22:45,757 ERROR [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-5-thread-40) [64d51a89] Command org.ovirt.engine.core.bll.CancelMigrateVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Cancel migration has failed. Please try again in a few moments and track the VM's event log for details (Failed with error MIGRATION_CANCEL_ERROR_NO_VM and code 5100) all VMs are up (both engine and qemu processes) In audit I can see message the one of the VMs is down although all qemu processes are up I am not sure if we are good or not (as no VMs down but error in log and part of the VMs migrated) relevant hosts are camel-vdsb and camel-vdsc please advice so I will reopen/verify
Created attachment 844597 [details] logs
Verified on is30.
Created attachment 847101 [details] engine log for verification on is30 migration started & cancelled @ 14:13 Jan 8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html
(In reply to Aharon Canan from comment #17) > I tried to verify this one - > > 1. 5 VMs running, > 2. Migrate all > 3. while migration cancel migration > > 3 VMs migrated, 2 didn't > from engine log: > 2014-01-02 17:22:45,757 ERROR > [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-5-thread-40) > [64d51a89] Command org.ovirt.engine.core.bll.CancelMigrateVmCommand throw > Vdc Bll exception. With error message VdcBLLException: > org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: > VDSGenericException: VDSErrorException: Cancel migration has failed. Please > try again in a few moments and track the VM's event log for details (Failed > with error MIGRATION_CANCEL_ERROR_NO_VM and code 5100) > chances are that those 2 VMs are waiting in the queue for migration so the cancel migration action didn't find anything to cancel. > > all VMs are up (both engine and qemu processes) > In audit I can see message the one of the VMs is down although all qemu > processes are up > I am not sure if we are good or not (as no VMs down but error in log and > part of the VMs migrated) > > relevant hosts are camel-vdsb and camel-vdsc > > please advice so I will reopen/verify