Red Hat Bugzilla – Bug 1473863
live-migration with boot from volume + config_drive takes a very long time.
Last modified: 2017-08-11 09:45:44 EDT
Description of problem:
Using openstack server migrate --live takes a long time to move specific instances:
boot from volume no config_drive it took less than 5 seconds
boot from image with config_drive it took less than 5 seconds
boot from volume with config_drive it took between 40 and 90 minutes
We migrated 4 Virtual machines between compute nodes in a RHOSP10 deployed overcloud.
one migration resulted in a huge wait for the VM to migrate. /var/log/nova/nova-compute.log shows that this single migration took more than 45+ minutes without any real activity until the end of the migration
openstack server migrate --live edtnabtfhv23.localdomain 416db9ca-e3ee-4062-8560-60823f9a0350
Migrating 3 other VM's between the same 2 physical servers took far less time. These three instances took mere seconds to complete.
openstack server migrate --block-migration --live edtnabtfhv23.localdomain 657b8e74-2879-4177-8f5f-bd3cfb81a259
openstack server migrate --live edtnabtfhv23.localdomain 256850f7-0efc-43bc-be69-fb3b380f071d
openstack server migrate --live edtnabtfhv23.localdomain 09e5084f-e8a7-4c6a-aa6c-aca6d91b63e9
The virtual machines that are booted from volume without config drive take seconds to migrate.
The virtual machines that are booted from image with or without config drive take 10's of seconds to migrate.
The virtual machine booted from volume with config drive took a huge amount of time.
sosreports and testing details can be found here -->
Can we get some traction for this BZ?
I've had a look through the logs in the sosreports. Unfortunately I don't see any end-to-end migration logs. I've put the request ids below, along with instance uuid and the first and last log timestamp I see for that request:
The first request seems to complete quite quickly. The other 3 all take a suspiciously similar amount of time, like the delay has some common cause. Perhaps it was blocked on some error elsewhere?
The only logs we have are nova-api.logs from seemingly all controllers, and nova-compute.log from edtnabtfhv23.localdomain. All of the above logs are from requests where edtnabtfhv23.localdomain is the destination compute.
To try to get to the bottom of this we need:
* nova-api.log from all controllers
* nova-conductor.log from all controllers
* nova-compute.log from both source and destination computes
My best guess is that the cause of the delay is either on the source compute or in conductor, because I don't see it on the destination.
I can't reproduce this locally, btw. Migration of boot from volume with config drive completes normally here.