Bug 1023131

Summary:

DestroyVDSCommand called after CancelMigrateVDSCommand failure when attempting to cancel multiple live migrations at a time

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Julio Entrena Perez <jentrena>

Component:

ovirt-engine

Assignee:

Vinzenz Feenstra [evilissimo] <vfeenstr>

Status:

CLOSED ERRATA

QA Contact:

meital avital <mavital>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.2.0

CC:

acanan, acathrow, flo_bugzilla, hchiramm, iheim, istein, jentrena, juzhang, lpeer, lyarwood, mavital, michal.skrivanek, rgolan, Rhev-m-bugs, sherold, vfeenstr, yeylon

Target Milestone:

---

Keywords:

Triaged, ZStream

Target Release:

3.3.0

Hardware:

All

OS:

Linux

Whiteboard:

virt

Fixed In Version:

is25

Doc Type:

Bug Fix

Doc Text:

After attempting to cancel multiple live migrations, some virtual machines were killed. To fix this, when the migration is cancelled, libvirt raises an error to prevent the operation from proceeding, which also avoids calling the destination VDSM to create the virtual machine instance.

Story Points:

---

Clone Of:

Clones:

1033153 (view as bug list)

Environment:

Last Closed:

2014-01-21 17:38:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1033153, 1038284

Attachments:

Description	Flags
snip of engine.log for mc1 guest (killed)	none
snip of engine.log for mc2 guest (not killed)	none
snip of engine.log for mc3 guest (not killed)	none
snip of engine.log for mc4 guest (killed)	none
snip of engine.log for rhel5a guest (killed)	none
logs	none
engine log for verification on is30	none

Description Julio Entrena Perez 2013-10-24 17:03:36 UTC

Description of problem:
After attempting to cancel multiple live migrations some guests are killed.

Version-Release number of selected component (if applicable):
rhevm-backend-3.2.3-0.43.el6ev
libvirt-0.10.2-18.el6_4.14
vdsm-4.10.2-25.1.el6ev

How reproducible:
Always.

Steps to Reproduce:
1. Start 5 live migrations.
2. Wait for webadmin portal to display the hourglass for each guest.
3. Cancel the 5 live migrations.

Actual results:
3 of the guests are killed.

Expected results:
None of the guests are killed.

Additional info:
engine calls DestroyVDSCommand twice on both source and destination hosts after CancelMigrateVDS fails.

Comment 1 Julio Entrena Perez 2013-10-24 17:05:59 UTC

Two hosts:

rhevh1 dec9a48a-6ad4-4ee3-b248-ae4fa51c1d05
rhevh2 02427f0e-a117-11e2-b83d-0010182d03a8

Five guests:

mc1    a1226783-b858-4c08-a333-16fd59e42c36
mc2    96d065d6-61f8-4c70-ac8b-b76553e498b5
mc3    c291a3e9-9952-4162-822d-348d041e066d
mc4    1d661d96-422a-45b0-a43f-f940900a8339
rhel5a 69b87dab-b675-4efd-8075-d3fc312b3b2c

On 24/10/2013 at 16:17 all 5 guests are requested to live migrate from rhevh1 to rhevh2.

Shortly afterwards all five live migrations are requested to cancel.

Guests mc2 and mc3 survive in rhevh1 host as expected.

Guests mc1, mc4 and rhel5a are killed.

Comment 2 Julio Entrena Perez 2013-10-24 17:06:51 UTC

Created attachment 815855 [details]
snip of engine.log for mc1 guest (killed)

Comment 3 Julio Entrena Perez 2013-10-24 17:07:19 UTC

Created attachment 815856 [details]
snip of engine.log for mc2 guest (not killed)

Comment 4 Julio Entrena Perez 2013-10-24 17:07:48 UTC

Created attachment 815857 [details]
snip of engine.log for mc3 guest (not killed)

Comment 5 Julio Entrena Perez 2013-10-24 17:08:10 UTC

Created attachment 815858 [details]
snip of engine.log for mc4 guest (killed)

Comment 6 Julio Entrena Perez 2013-10-24 17:08:33 UTC

Created attachment 815859 [details]
snip of engine.log for rhel5a guest (killed)

Comment 10 Roy Golan 2013-11-18 14:34:43 UTC

qemu kills the VM on the source rhevh1, during migration, due to network iface related error

### libvirt.log

2013-10-24 15:18:11.983+0000: 2392: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f7a780cf520 buf={"timestamp": {"seconds": 1382627891, "microseconds": 983395}, "event": "SHUTDOWN"}
 len=85
2013-10-24 15:18:11.983+0000: 2392: debug : qemuMonitorEmitShutdown:988 : mon=0x7f7a780cf520
2013-10-24 15:18:11.983+0000: 2392: debug : qemuProcessHandleShutdown:654 : vm=0x7f7a780d5ce0
2013-10-24 15:18:12.367+0000: 2404: debug : qemuDomainObjBeginJobInternal:808 : Starting job: destroy (async=none)
2013-10-24 15:18:12.407+0000: 2404: debug : qemuProcessStop:4193 : Shutting down VM 'mc1' pid=11077 flags=0
2013-10-24 15:18:12.408+0000: 2404: error : virNWFilterDHCPSnoopEnd:2131 : internal error ifname "vnet1" not in key map
2013-10-24 15:18:12.458+0000: 2404: error : virNetDevGetIndex:653 : Unable to get index for interface vnet1: No such device


VDSM on source sets the VM to down -> engine sends a destroy

VDSM on destination has no update on the socket for 5 minutes so setting vm to down ->  engine sends the second destory

Comment 11 Roy Golan 2013-11-19 10:21:36 UTC

This is a VDSM bug:

 see the cancelMigrate hapend before the Migrate hapend. which leaves a boolean flag  _migrationCanceledEvt set to "True"

so migrateToUri2 is never called:

from vm.py
if not self._migrationCanceledEvt:
                  self._vm._dom.migrateToURI2(

but the finally clause is clearing the VM. which also clears the interfaces and sets the status to Down

startUnderlyingMigration()
  ...
  finally:
                t.cancel()
                if MigrationMonitorThread._MIGRATION_MONITOR_INTERVAL:
                    self._monitorThread.stop()

t.cancel() is set the _stop event which shutsdown the VM

Comment 12 Roy Golan 2013-11-19 10:34:26 UTC

correction, its the finishSucessfully method which is being called which sets the  vm status to down and then the backend sends a destroy as part of a regular migration flow.

Comment 14 Vinzenz Feenstra [evilissimo] 2013-11-22 10:35:39 UTC

Changed merged to u/s master as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=4b369364084d662460967029fc0d64cb60884ed7

Comment 15 Ilanit Stein 2013-12-17 07:10:06 UTC

Verified on is27.

By flow in description.

None of the guests were killed. 

engine.log for a VM for which migration cancelled:

2013-12-16 17:00:44,337 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (pool-4-thread-49) [4c001553] Candidate host silver-vdsc.qa.lab.tlv.redhat.com (5f9f30ff-9bfd-449d-97eb-6e8d6e0d7a02) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory (correlation id: 4c001553)
2013-12-16 17:00:44,339 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] (pool-4-thread-49) [4c001553] Command org.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)
2013-12-16 17:00:44,350 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand] (pool-4-thread-49) [4c001553] Transaction rolled-back for command: org.ovirt.engine.core.bll.MigrateVmCommand.
2013-12-16 17:00:44,373 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-49) [4c001553] Correlation ID: 4c001553, Job ID: 0846178a-1420-4b0c-9a0b-885e96bce384, Call Stack: null, Custom Event ID: -1, Message: Migration failed  (VM: lin3, Source: cyan-vdse.qa.lab.tlv.redhat.com, Destination: <UNKNOWN>).

Comment 16 Roy Golan 2013-12-19 07:53:59 UTC

(In reply to Ilanit Stein from comment #15)
> Verified on is27.
> 
> By flow in description.
> 
> None of the guests were killed. 
> 
> engine.log for a VM for which migration cancelled:
> 
> 2013-12-16 17:00:44,337 INFO 
> [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (pool-4-thread-49)
> [4c001553] Candidate host silver-vdsc.qa.lab.tlv.redhat.com
> (5f9f30ff-9bfd-449d-97eb-6e8d6e0d7a02) was filtered out by
> VAR__FILTERTYPE__INTERNAL filter Memory (correlation id: 4c001553)
> 2013-12-16 17:00:44,339 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand]
> (pool-4-thread-49) [4c001553] Command
> org.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With
> error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with
> error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)
> 2013-12-16 17:00:44,350 ERROR [org.ovirt.engine.core.bll.MigrateVmCommand]
> (pool-4-thread-49) [4c001553] Transaction rolled-back for command:
> org.ovirt.engine.core.bll.MigrateVmCommand.
> 2013-12-16 17:00:44,373 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (pool-4-thread-49) [4c001553] Correlation ID: 4c001553, Job ID:
> 0846178a-1420-4b0c-9a0b-885e96bce384, Call Stack: null, Custom Event ID: -1,
> Message: Migration failed  (VM: lin3, Source:
> cyan-vdse.qa.lab.tlv.redhat.com, Destination: <UNKNOWN>).

this doesn't look good at all. the migrate command was never called. 

rg.ovirt.engine.core.bll.MigrateVmCommand throw Vdc Bll exception. With
> error message VdcBLLException: RESOURCE_MANAGER_VDS_NOT_FOUND (Failed with
> error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)


this needs verification again.

Comment 17 Aharon Canan 2014-01-02 15:41:07 UTC

I tried to verify this one - 

1. 5 VMs running, 
2. Migrate all
3. while migration cancel migration

3 VMs migrated, 2 didn't
from engine log: 
2014-01-02 17:22:45,757 ERROR [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-5-thread-40) [64d51a89] Command org.ovirt.engine.core.bll.CancelMigrateVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Cancel migration has failed. Please try again in a few moments and track the VM's event log for details (Failed with error MIGRATION_CANCEL_ERROR_NO_VM and code 5100)


all VMs are up (both engine and qemu processes)
In audit I can see message the one of the VMs is down although all qemu processes are up
I am not sure if we are good or not (as no VMs down but error in log and part of the VMs migrated)

relevant hosts are camel-vdsb and camel-vdsc

please advice so I will reopen/verify

Comment 18 Aharon Canan 2014-01-02 15:41:47 UTC

Created attachment 844597 [details]
logs

Comment 19 Ilanit Stein 2014-01-08 12:25:14 UTC

Verified on is30.

Comment 20 Ilanit Stein 2014-01-08 12:27:22 UTC

Created attachment 847101 [details]
engine log for verification on is30

migration started & cancelled @ 14:13 Jan 8

Comment 21 errata-xmlrpc 2014-01-21 17:38:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html

Comment 22 Roy Golan 2014-02-09 11:54:03 UTC

(In reply to Aharon Canan from comment #17)
> I tried to verify this one - 
> 
> 1. 5 VMs running, 
> 2. Migrate all
> 3. while migration cancel migration
> 
> 3 VMs migrated, 2 didn't
> from engine log: 
> 2014-01-02 17:22:45,757 ERROR
> [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-5-thread-40)
> [64d51a89] Command org.ovirt.engine.core.bll.CancelMigrateVmCommand throw
> Vdc Bll exception. With error message VdcBLLException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericException: VDSErrorException: Cancel migration has failed. Please
> try again in a few moments and track the VM's event log for details (Failed
> with error MIGRATION_CANCEL_ERROR_NO_VM and code 5100)
> 

chances are that those 2 VMs are waiting in the queue for migration so the cancel migration action didn't find anything to cancel.

> 
> all VMs are up (both engine and qemu processes)
> In audit I can see message the one of the VMs is down although all qemu
> processes are up
> I am not sure if we are good or not (as no VMs down but error in log and
> part of the VMs migrated)
> 
> relevant hosts are camel-vdsb and camel-vdsc
> 
> please advice so I will reopen/verify