1566699 – live-migration never hits "Migration operation thread has finished" and remains paused on both source and destination

Bug 1566699 - live-migration never hits "Migration operation thread has finished" and remains paused on both source and destination

Summary: live-migration never hits "Migration operation thread has finished" and remai...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-12 19:58 UTC by David Vallee Delisle
Modified:	2023-03-21 18:47 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-26 08:04:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-8941	0	None	None	None	2022-08-09 10:57:15 UTC

Description David Vallee Delisle 2018-04-12 19:58:25 UTC

Description of problem:
When migrating ~20 VMs from cmp0 to cmp1, and then to cmp2, we have a fail rate of ~10% most of the time. When it fails, the instance is paused on both source and destination compute (that would be cmp1 and cmp2 for example).

This is the exact same VM, with no load on the guest.

We incremented the max_workers and max_requests from the default of 20 to 25 and we disabled selinux on both source and destination compute.

When this happens, I compared the nova-compute logs from both successful and failed attempts and here are my observations:

- When plugging VIF, cleaned was set to False on successfull migration and True on failed one.

- After running for 70 seconds, on successful migration we see VM Paused (Lifecycle Event) on source host and we see "Migration operation has completed"

- After running for 140 seconds, on failed migration, we do see the same messages about VM Paused on source host and we do see the "Migration operation has completed" like on successful migration.

- On both migrations, we see "_post_live_migration() is started"

- On success, it's immediately followed by "Migration operation thread has finished _live_migration_operation" and by "Migration operation thread notification thread_finished", and VM is resumed on destination host and stopped on source host, as we could expect.

- On failed, we hit "_post_live_migration() is started", then nothing for 52 seconds, except for regular nova calls, and we hit "Calling driver.unfilter_instance from _post_live_migration _post_live_migration"

- We never hit "Migration operation thread has finished _live_migration_operation" or "Migration operation thread notification thread_finished"

- On the failed scenario, and after the driver.unfilter_instance, we see the vif data. Again, we see the clean_attempts=1 and cleaned=True, as opposed to clean_attempts=0 for successful migration

- On both the failed and succesfull migrations, we endup hitting "Post operation of migration started"
- On both migrations, we get a "Deleting instance files" on the source host. 

- On the failed migration, we end up getting recurring messages like this:
~~~
Periodic task is updating the host stat, it is trying to get disk instance-0000079c, but disk file was removed by concurrent operations such as resize.: OSError: [Errno 2] No such file or directory: '/var/lib/nova/instances/f025b152-12bd-418e-859e-0fcc399c08b4/disk.config'
~~~

Version-Release number of selected component (if applicable):
openstack-nova-compute-16.0.2-9.el7ost.noarch
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
libvirt-3.2.0-14.el7_4.7.x86_64

How reproducible:
Most of the time

Comment 8 Lee Yarwood 2018-04-26 08:04:12 UTC

Closing this out as CURRENTRELEASE, please feel free to reopen if you encounter the issue again with the latest containers.

Note You need to log in before you can comment on or make changes to this bug.