Bug 1023131
Summary: | DestroyVDSCommand called after CancelMigrateVDSCommand failure when attempting to cancel multiple live migrations at a time | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Julio Entrena Perez <jentrena> | |
Component: | ovirt-engine | Assignee: | Vinzenz Feenstra [evilissimo] <vfeenstr> | |
Status: | CLOSED ERRATA | QA Contact: | meital avital <mavital> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 3.2.0 | CC: | acanan, acathrow, flo_bugzilla, hchiramm, iheim, istein, jentrena, juzhang, lpeer, lyarwood, mavital, michal.skrivanek, rgolan, Rhev-m-bugs, sherold, vfeenstr, yeylon | |
Target Milestone: | --- | Keywords: | Triaged, ZStream | |
Target Release: | 3.3.0 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | virt | |||
Fixed In Version: | is25 | Doc Type: | Bug Fix | |
Doc Text: |
After attempting to cancel multiple live migrations, some virtual machines were killed. To fix this, when the migration is cancelled, libvirt raises an error to prevent the operation from proceeding, which also avoids calling the destination VDSM to create the virtual machine instance.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1033153 (view as bug list) | Environment: | ||
Last Closed: | 2014-01-21 17:38:14 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1033153, 1038284 | |||
Attachments: |
Description
Julio Entrena Perez
2013-10-24 17:03:36 UTC
Two hosts: rhevh1 dec9a48a-6ad4-4ee3-b248-ae4fa51c1d05 rhevh2 02427f0e-a117-11e2-b83d-0010182d03a8 Five guests: mc1 a1226783-b858-4c08-a333-16fd59e42c36 mc2 96d065d6-61f8-4c70-ac8b-b76553e498b5 mc3 c291a3e9-9952-4162-822d-348d041e066d mc4 1d661d96-422a-45b0-a43f-f940900a8339 rhel5a 69b87dab-b675-4efd-8075-d3fc312b3b2c On 24/10/2013 at 16:17 all 5 guests are requested to live migrate from rhevh1 to rhevh2. Shortly afterwards all five live migrations are requested to cancel. Guests mc2 and mc3 survive in rhevh1 host as expected. Guests mc1, mc4 and rhel5a are killed. Created attachment 815855 [details]
snip of engine.log for mc1 guest (killed)
Created attachment 815856 [details]
snip of engine.log for mc2 guest (not killed)
Created attachment 815857 [details]
snip of engine.log for mc3 guest (not killed)
Created attachment 815858 [details]
snip of engine.log for mc4 guest (killed)
Created attachment 815859 [details]
snip of engine.log for rhel5a guest (killed)
qemu kills the VM on the source rhevh1, during migration, due to network iface related error ### libvirt.log 2013-10-24 15:18:11.983+0000: 2392: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f7a780cf520 buf={"timestamp": {"seconds": 1382627891, "microseconds": 983395}, "event": "SHUTDOWN"} len=85 2013-10-24 15:18:11.983+0000: 2392: debug : qemuMonitorEmitShutdown:988 : mon=0x7f7a780cf520 2013-10-24 15:18:11.983+0000: 2392: debug : qemuProcessHandleShutdown:654 : vm=0x7f7a780d5ce0 2013-10-24 15:18:12.367+0000: 2404: debug : qemuDomainObjBeginJobInternal:808 : Starting job: destroy (async=none) 2013-10-24 15:18:12.407+0000: 2404: debug : qemuProcessStop:4193 : Shutting down VM 'mc1' pid=11077 flags=0 2013-10-24 15:18:12.408+0000: 2404: error : virNWFilterDHCPSnoopEnd:2131 : internal error ifname "vnet1" not in key map 2013-10-24 15:18:12.458+0000: 2404: error : virNetDevGetIndex:653 : Unable to get index for interface vnet1: No such device VDSM on source sets the VM to down -> engine sends a destroy VDSM on destination has no update on the socket for 5 minutes so setting vm to down -> engine sends the second destory This is a VDSM bug: see the cancelMigrate hapend before the Migrate hapend. which leaves a boolean flag _migrationCanceledEvt set to "True" so migrateToUri2 is never called: from vm.py if not self._migrationCanceledEvt: self._vm._dom.migrateToURI2( but the finally clause is clearing the VM. which also clears the interfaces and sets the status to Down startUnderlyingMigration() ... finally: t.cancel() if MigrationMonitorThread._MIGRATION_MONITOR_INTERVAL: self._monitorThread.stop() t.cancel() is set the _stop event which shutsdown the VM correction, its the finishSucessfully method which is being called which sets the vm status to down and then the backend sends a destroy as part of a regular migration flow. Changed merged to u/s master as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=4b369364084d662460967029fc0d64cb60884ed7 Verified on is27. By flow in description. None of the guests were killed. engine.log for a VM for which migration cancelled: 2013-12-16 17:00:44,337 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (pool-4-thread-49) [4c001553] Candidate host silver-vdsc.qa.lab.tlv.redhat.com (5f9f30ff-9bfd-449d-97eb-6e8d6e0d7a02) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory (correlation id: 4c001553) 2013-12-16 17:00:44,339 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] (pool-4-thread-49) [4c001553] Command org.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) 2013-12-16 17:00:44,350 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] (pool-4-thread-49) [4c001553] Transaction rolled-back for command: org.ovirt.engine.core.bll.MigrateVmCommand. 2013-12-16 17:00:44,373 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-49) [4c001553] Correlation ID: 4c001553, Job ID: 0846178a-1420-4b0c-9a0b-885e96bce384, Call Stack: null, Custom Event ID: -1, Message: Migration failed (VM: lin3, Source: cyan-vdse.qa.lab.tlv.redhat.com, Destination: <UNKNOWN>). (In reply to Ilanit Stein from comment #15) > Verified on is27. > > By flow in description. > > None of the guests were killed. > > engine.log for a VM for which migration cancelled: > > 2013-12-16 17:00:44,337 INFO > [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (pool-4-thread-49) > [4c001553] Candidate host silver-vdsc.qa.lab.tlv.redhat.com > (5f9f30ff-9bfd-449d-97eb-6e8d6e0d7a02) was filtered out by > VAR__FILTERTYPE__INTERNAL filter Memory (correlation id: 4c001553) > 2013-12-16 17:00:44,339 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] > (pool-4-thread-49) [4c001553] Command > org.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With > error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with > error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) > 2013-12-16 17:00:44,350 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] > (pool-4-thread-49) [4c001553] Transaction rolled-back for command: > org.ovirt.engine.core.bll.MigrateVmCommand. > 2013-12-16 17:00:44,373 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (pool-4-thread-49) [4c001553] Correlation ID: 4c001553, Job ID: > 0846178a-1420-4b0c-9a0b-885e96bce384, Call Stack: null, Custom Event ID: -1, > Message: Migration failed (VM: lin3, Source: > cyan-vdse.qa.lab.tlv.redhat.com, Destination: <UNKNOWN>). this doesn't look good at all. the migrate command was never called. rg.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With > error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with > error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) this needs verification again. I tried to verify this one - 1. 5 VMs running, 2. Migrate all 3. while migration cancel migration 3 VMs migrated, 2 didn't from engine log: 2014-01-02 17:22:45,757 ERROR [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-5-thread-40) [64d51a89] Command org.ovirt.engine.core.bll.CancelMigrateVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Cancel migration has failed. Please try again in a few moments and track the VM's event log for details (Failed with error MIGRATION_CANCEL_ERROR_NO_VM and code 5100) all VMs are up (both engine and qemu processes) In audit I can see message the one of the VMs is down although all qemu processes are up I am not sure if we are good or not (as no VMs down but error in log and part of the VMs migrated) relevant hosts are camel-vdsb and camel-vdsc please advice so I will reopen/verify Created attachment 844597 [details]
logs
Verified on is30. Created attachment 847101 [details]
engine log for verification on is30
migration started & cancelled @ 14:13 Jan 8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html (In reply to Aharon Canan from comment #17) > I tried to verify this one - > > 1. 5 VMs running, > 2. Migrate all > 3. while migration cancel migration > > 3 VMs migrated, 2 didn't > from engine log: > 2014-01-02 17:22:45,757 ERROR > [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-5-thread-40) > [64d51a89] Command org.ovirt.engine.core.bll.CancelMigrateVmCommand throw > Vdc Bll exception. With error message VdcBLLException: > org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: > VDSGenericException: VDSErrorException: Cancel migration has failed. Please > try again in a few moments and track the VM's event log for details (Failed > with error MIGRATION_CANCEL_ERROR_NO_VM and code 5100) > chances are that those 2 VMs are waiting in the queue for migration so the cancel migration action didn't find anything to cancel. > > all VMs are up (both engine and qemu processes) > In audit I can see message the one of the VMs is down although all qemu > processes are up > I am not sure if we are good or not (as no VMs down but error in log and > part of the VMs migrated) > > relevant hosts are camel-vdsb and camel-vdsc > > please advice so I will reopen/verify |