Bug 1566699

Summary: live-migration never hits "Migration operation thread has finished" and remains paused on both source and destination
Product: Red Hat OpenStack Reporter: David Vallee Delisle <dvd>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED CURRENTRELEASE QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: berrange, dasmith, dhill, dvd, eglynn, jhakimra, kchamart, lyarwood, marjones, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-26 08:04:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Vallee Delisle 2018-04-12 19:58:25 UTC
Description of problem:
When migrating ~20 VMs from cmp0 to cmp1, and then to cmp2, we have a fail rate of ~10% most of the time. When it fails, the instance is paused on both source and destination compute (that would be cmp1 and cmp2 for example).

This is the exact same VM, with no load on the guest.

We incremented the max_workers and max_requests from the default of 20 to 25 and we disabled selinux on both source and destination compute.

When this happens, I compared the nova-compute logs from both successful and failed attempts and here are my observations:

- When plugging VIF, cleaned was set to False on successfull migration and True on failed one.

- After running for 70 seconds, on successful migration we see VM Paused (Lifecycle Event) on source host and we see "Migration operation has completed"

- After running for 140 seconds, on failed migration, we do see the same messages about VM Paused on source host and we do see the "Migration operation has completed" like on successful migration.

- On both migrations, we see "_post_live_migration() is started"

- On success, it's immediately followed by "Migration operation thread has finished _live_migration_operation" and by "Migration operation thread notification thread_finished", and VM is resumed on destination host and stopped on source host, as we could expect.

- On failed, we hit "_post_live_migration() is started", then nothing for 52 seconds, except for regular nova calls, and we hit "Calling driver.unfilter_instance from _post_live_migration _post_live_migration"

- We never hit "Migration operation thread has finished _live_migration_operation" or "Migration operation thread notification thread_finished"

- On the failed scenario, and after the driver.unfilter_instance, we see the vif data. Again, we see the clean_attempts=1 and cleaned=True, as opposed to clean_attempts=0 for successful migration

- On both the failed and succesfull migrations, we endup hitting "Post operation of migration started"
- On both migrations, we get a "Deleting instance files" on the source host. 

- On the failed migration, we end up getting recurring messages like this:
~~~
Periodic task is updating the host stat, it is trying to get disk instance-0000079c, but disk file was removed by concurrent operations such as resize.: OSError: [Errno 2] No such file or directory: '/var/lib/nova/instances/f025b152-12bd-418e-859e-0fcc399c08b4/disk.config'
~~~

Version-Release number of selected component (if applicable):
openstack-nova-compute-16.0.2-9.el7ost.noarch
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
libvirt-3.2.0-14.el7_4.7.x86_64

How reproducible:
Most of the time

Comment 8 Lee Yarwood 2018-04-26 08:04:12 UTC
Closing this out as CURRENTRELEASE, please feel free to reopen if you encounter the issue again with the latest containers.