Description of problem: Live-migration fails because the *source* host has insufficient disk space / is overcommited in OSP 13 Version-Release number of selected component (if applicable): [akaris@collab-shell ops-compute8]$ grep nova ./sos_commands/docker/docker_ps 1523ce7a68d4 192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-82 "kolla_start" 6 weeks ago Up 6 minutes (healthy) nova_compute 1480a6d3a5c2 192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-82 "kolla_start" 6 weeks ago Up 5 minutes (healthy) nova_migration_target 6ac2a9fe4448 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:13.0-85 "kolla_start" 6 weeks ago Up 5 minutes nova_libvirt 67911457d447 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:13.0-85 "kolla_start" 6 weeks ago Up 5 minutes nova_virtlogd [akaris@collab-shell ops-compute8]$ grep nova installed-rpms puppet-nova-12.4.0-16.el7ost.noarch Fri Mar 29 20:49:30 2019 python2-novaclient-10.1.0-1.el7ost.noarch Fri Mar 29 20:50:22 2019 [akaris@collab-shell ops-compute8]$ Additional info: ISSUE: Looking at the logs we saw the warning below, which ultimately causes the migration to fail: /var/log/containers/nova/nova-conductor.log:2019-05-17 00:05:35.513 27 WARNING nova.scheduler.client.report [req-3b20e41e-86e3-44e8-8deb-fae0ca976cb5 0a1c543eec7c47faaab1f5d7717b2e38 72fefc72b09e425fb151478146da823b - default default] Unable to post allocations for instance 0d11d5a7-8f75-4403-b2b8-46e47acf4197 (409 {"errors": [{"status": 409, "request_id": "req-73718779-f6c5-4283-b913-b616a9309557", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'DISK_GB' on resource provider '29daeda0-40e4-42be-8207-6e11dcf9a040'. The requested amount would exceed the capacity. ", "title": "Conflict"}]}) Specifically looking at this piece: Unable to create allocation for 'DISK_GB' on resource provider '29daeda0-40e4-42be-8207-6e11dcf9a040'. We found yesterday from the nova hypervisor-list that this provider is the source: nova hypervisor-list | grep 29daeda0-40e4-42be-8207-6e11dcf9a040 | 29daeda0-40e4-42be-8207-6e11dcf9a040 | ops-compute8 | up | disabled | (FYI, if that hypervisor is enabled or disabled doesn't make a difference) Therefore the live/cold migrations are failing because the source host is over-committed on the disk usage. I think there are two BUGS here: 1. The nova scheduler should not care about the disk space being over-committed on the source. Why does it matter that the source disk capacity is too low? We are about to migrate off of the source, shouldn't nova scheduler only be concerned about the disk space on the destination server? 2. Our disk space being utilized for these instances is on a separate cinder backend, we aren't using any local disk on the compute8 server, so even if the first bug is resolved, and nova scheduler is looking at the destination host, it is still generating a false value for the used disk space or DISK_GB. ops-compute8 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 559G 17G 542G 3% / There needs to be some way to tell nova that the local disk on the compute host is not being used, and that it is the cinder backend that has is holding the volume. RESOLUTION for Live Migration: On the source compute host, compute8 we had to do the following: sudo crudini --get /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf DEFAULT disk_allocation_ratio 1.0 sudo crudini --set /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf DEFAULT disk_allocation_ratio 1.5 sudo docker ps | awk '/nova_/ {print $NF}' | xargs -I {} sudo docker restart {} Then Live Migration works: openstack server migrate bbf06dd1-4f86-4b46-850f-08fc37c3f6cb --live ops-compute14 --wait Progress: 100Complete
(In reply to Andreas Karis from comment #0) > Additional info: > > ISSUE: > > Looking at the logs we saw the warning below, which ultimately causes the > migration to fail: > > /var/log/containers/nova/nova-conductor.log:2019-05-17 00:05:35.513 27 > WARNING nova.scheduler.client.report > [req-3b20e41e-86e3-44e8-8deb-fae0ca976cb5 0a1c543eec7c47faaab1f5d7717b2e38 > 72fefc72b09e425fb151478146da823b - default default] Unable to post > allocations for instance 0d11d5a7-8f75-4403-b2b8-46e47acf4197 (409 > {"errors": [{"status": 409, "request_id": > "req-73718779-f6c5-4283-b913-b616a9309557", "detail": "There was a conflict > when trying to complete your request.\n\n Unable to allocate inventory: > Unable to create allocation for 'DISK_GB' on resource provider > '29daeda0-40e4-42be-8207-6e11dcf9a040'. The requested amount would exceed > the capacity. ", "title": "Conflict"}]}) > > Specifically looking at this piece: > > Unable to create allocation for 'DISK_GB' on resource provider > '29daeda0-40e4-42be-8207-6e11dcf9a040'. > > We found yesterday from the nova hypervisor-list that this provider is the > source: > > nova hypervisor-list | grep 29daeda0-40e4-42be-8207-6e11dcf9a040 > | 29daeda0-40e4-42be-8207-6e11dcf9a040 | ops-compute8 | up | disabled | > > (FYI, if that hypervisor is enabled or disabled doesn't make a difference) > > Therefore the live/cold migrations are failing because the source host is > over-committed on the disk usage. > > I think there are two BUGS here: > > 1. The nova scheduler should not care about the disk space being > over-committed on the source. Why does it matter that the source disk > capacity is too low? We are about to migrate off of the source, shouldn't > nova scheduler only be concerned about the disk space on the destination > server? Unfortunately this is expected behaviour in Queens when booting from volumes while also using flavors with disk set to greater than 0. The additional claim on the source is due to the instance allocation being recreated as a migration allocation ready to be moved to the destination: https://github.com/openstack/nova/blob/076c576eff7f6ef05ad1d3842e251bd7911acf29/nova/conductor/tasks/migrate.py#L27-L77 This behaviour was changed recently in Stein to an atomic move but isn't backportable to Queens. The real issue here however is with the original allocation containing DISK_GB, the following unbackportable bugfix in stable/rocky has since removed that for boot from volume instances: https://review.opendev.org/#/q/topic:bug/1469179+(status:open+OR+status:merged) > 2. Our disk space being utilized for these instances is on a separate cinder > backend, we aren't using any local disk on the compute8 server, so even if > the first bug is resolved, and nova scheduler is looking at the destination > host, it is still generating a false value for the used disk space or > DISK_GB. > ops-compute8 ~]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/sda2 559G 17G 542G 3% / > > There needs to be some way to tell nova that the local disk on the compute > host is not being used, and that it is the cinder backend that has is > holding the volume. Again that's addressed by the following bugfix but this isn't something we can backport: https://review.opendev.org/#/q/topic:bug/1469179+(status:open+OR+status:merged) > RESOLUTION for Live Migration: > > On the source compute host, compute8 we had to do the following: > > sudo crudini --get > /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf > DEFAULT disk_allocation_ratio > 1.0 > sudo crudini --set > /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf > DEFAULT disk_allocation_ratio 1.5 > sudo docker ps | awk '/nova_/ {print $NF}' | xargs -I {} sudo docker restart > {} > > Then Live Migration works: > > openstack server migrate bbf06dd1-4f86-4b46-850f-08fc37c3f6cb --live > ops-compute14 --wait > Progress: 100Complete I want to run this past the compute team but I think another workaround here would be to rebuild these bfv instances using a flavor with disk set to 0.
FWIW https://bugzilla.redhat.com/show_bug.cgi?id=1643419 covers this for the initial boot of bfv instances on over committed hosts.
Hi, Thanks for the info! - Andreas
(In reply to Lee Yarwood from comment #3) > I want to run this past the compute team but I think another workaround here > would be to rebuild these bfv instances using a flavor with disk set to 0. Apologies for the delay, just to close this out I've tested the following workaround upstream using stable/queens to correct the DISK_GB allocations using resize: $ nova flavor-create bfv-with-disk 10 512 10 1 [..] $ nova flavor-create bfv-without-disk 11 512 0 1 [..] $ nova boot --flavor bfv-with-disk --block-device id=b001a270-09ca-47ff-af2f-a0e306384ba2,source=volume,dest=volume,bootindex=0 test [..] $ openstack resource provider allocation show a85e3a1f-853b-4993-bef1-d578dd3b8295 +--------------------------------------+------------+-------------------------------------------------+ | resource_provider | generation | resources | +--------------------------------------+------------+-------------------------------------------------+ | e6199711-d17d-44c9-8a51-d4d50146ca6f | 2 | {u'VCPU': 1, u'MEMORY_MB': 512, u'DISK_GB': 10} | +--------------------------------------+------------+-------------------------------------------------+ [..] $ nova stop test [..] $ nova resize test bfv-without-disk [..] $ nova resize-confirm test [..] $ nova start test [..] $ openstack resource provider allocation show a85e3a1f-853b-4993-bef1-d578dd3b8295 +--------------------------------------+------------+---------------------------------+ | resource_provider | generation | resources | +--------------------------------------+------------+---------------------------------+ | e6199711-d17d-44c9-8a51-d4d50146ca6f | 4 | {u'VCPU': 1, u'MEMORY_MB': 512} | +--------------------------------------+------------+---------------------------------+ Closing this out as WONTFIX given the changes previously listed are not back-portable. Let me know if you have any further questions.
Thanks for the answer! :)