Description of problem:
When migrating ~20 VMs from cmp0 to cmp1, and then to cmp2, we have a fail rate of ~10% most of the time. When it fails, the instance is paused on both source and destination compute (that would be cmp1 and cmp2 for example).
This is the exact same VM, with no load on the guest.
We incremented the max_workers and max_requests from the default of 20 to 25 and we disabled selinux on both source and destination compute.
When this happens, I compared the nova-compute logs from both successful and failed attempts and here are my observations:
- When plugging VIF, cleaned was set to False on successfull migration and True on failed one.
- After running for 70 seconds, on successful migration we see VM Paused (Lifecycle Event) on source host and we see "Migration operation has completed"
- After running for 140 seconds, on failed migration, we do see the same messages about VM Paused on source host and we do see the "Migration operation has completed" like on successful migration.
- On both migrations, we see "_post_live_migration() is started"
- On success, it's immediately followed by "Migration operation thread has finished _live_migration_operation" and by "Migration operation thread notification thread_finished", and VM is resumed on destination host and stopped on source host, as we could expect.
- On failed, we hit "_post_live_migration() is started", then nothing for 52 seconds, except for regular nova calls, and we hit "Calling driver.unfilter_instance from _post_live_migration _post_live_migration"
- We never hit "Migration operation thread has finished _live_migration_operation" or "Migration operation thread notification thread_finished"
- On the failed scenario, and after the driver.unfilter_instance, we see the vif data. Again, we see the clean_attempts=1 and cleaned=True, as opposed to clean_attempts=0 for successful migration
- On both the failed and succesfull migrations, we endup hitting "Post operation of migration started"
- On both migrations, we get a "Deleting instance files" on the source host.
- On the failed migration, we end up getting recurring messages like this:
Periodic task is updating the host stat, it is trying to get disk instance-0000079c, but disk file was removed by concurrent operations such as resize.: OSError: [Errno 2] No such file or directory: '/var/lib/nova/instances/f025b152-12bd-418e-859e-0fcc399c08b4/disk.config'
Version-Release number of selected component (if applicable):
Most of the time
Closing this out as CURRENTRELEASE, please feel free to reopen if you encounter the issue again with the latest containers.