Description of problem: DiskNotFound error while running nova resize-confirm (Rhos-13) on instances which were resized on Rhos-12. Version-Release number of selected component (if applicable): None How reproducible: Always Steps to Reproduce: 1. Resize instance on Rhos-12 (instance moved to VERIFY_RESIZE state) 2. Upgrade from Rhos-12 to Rhos-13 3. nova resize-confirm <instance> Actual results: (overcloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+------------+---------------+------------------------------------+--------+----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+------------+---------------+------------------------------------+--------+----------+ | 1374133a-2c08-4a8f-94f6-729d4e58d7e0 | test-10831 | VERIFY_RESIZE | private=192.168.100.6, 10.0.0.218 | cirros | m1.small | | f3ed9015-3984-43f4-b4a5-c2898052b47d | test-16556 | VERIFY_RESIZE | private=192.168.100.11, 10.0.0.213 | cirros | m1.small | +--------------------------------------+------------+---------------+------------------------------------+--------+----------+ (overcloud) [stack@undercloud-0 ~]$ nova resize-confirm test-10831 (overcloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+------------+---------------+------------------------------------+--------+----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+------------+---------------+------------------------------------+--------+----------+ | 1374133a-2c08-4a8f-94f6-729d4e58d7e0 | test-10831 | ERROR | private=192.168.100.6, 10.0.0.218 | cirros | m1.small | | f3ed9015-3984-43f4-b4a5-c2898052b47d | test-16556 | VERIFY_RESIZE | private=192.168.100.11, 10.0.0.213 | cirros | m1.small | +--------------------------------------+------------+---------------+------------------------------------+--------+----------+ (overcloud) [stack@undercloud-0 ~]$ openstack server show test-10831 +-------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-0.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-0.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-00000003 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | 2018-05-30T02:15:31.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | private=192.168.100.6, 10.0.0.218 | | config_drive | | | created | 2018-05-30T02:13:27Z | | fault | {u'message': u'Compute host compute-1.localdomain could not be found.', u'code': 404, u'created': u'2018-05-30T13:42:19Z'} | | flavor | m1.small (b8e997d1-222b-43f0-96b8-62349961ce35) | | hostId | d414005f2ca351cb7cb5792ca59deb8e75611f47f11af3b13467f50b | | id | 1374133a-2c08-4a8f-94f6-729d4e58d7e0 | | image | cirros (78cd87c9-4815-4579-a227-c38c38516073) | | key_name | None | | name | test-10831 | | project_id | d4483d13a6ef41b2ae575ddbd0c59141 | | properties | | | security_groups | name='default' | | status | ERROR | | updated | 2018-05-30T13:42:19Z | | user_id | 09abbd4893bb45eea8fb1d5e40635339 | | volumes_attached | | +-------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ From the logs: ./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 147, in get_allocated_disk_size ./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager return images.qemu_img_info(path).disk_size ./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info ./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager raise exception.DiskNotFound(location=path) ./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk ./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager Expected results: Instance successfully resized Additional info: [heat-admin@compute-0 ~]$ sudo virsh list Id Name State ---------------------------------------------------- 2 instance-00000003 running [heat-admin@compute-0 ~]$ sudo virsh domblk list instance-00000003 error: unknown command: 'domblk' [heat-admin@compute-0 ~]$ sudo virsh domblklist instance-00000003 Target Source ------------------------------------------------ vda /var/lib/nova/instances/1374133a-2c08-4a8f-94f6-729d4e58d7e0/disk
This isn't actually upgrade related because it happens *before* the upgrade, and should be reproducible without an upgrade. The bug flow is as follows: Create instance A on compute X Resize instance A to compute Y Domain is powered off /var/lib/nova/instances/<uuid A> renamed to <uuid A>_resize on X Domain is *not* undefined Resize instance B to compute X [as above] On compute X: update_available_resource runs as a periodic task First action is to update self rt calls driver.get_available_resource() ...calls _get_disk_over_committed_size_total ...iterates over all defined domains, including the ones whose disks we renamed ...fails because a referenced disk no longer exists ...consequently never updates RT with entry for current nodename However, note that at this point an RT entry exists anyway, because it previously ran successfully at least once. Restart both X and Y On compute X: update_available_resource runs as a period task update_available_resource continues to fail as above However, note that this time because we restarted after the error condition already existed, we never populate RT at all. Resize confirm instance A on compute Y ...calls confirm_resize on compute X ...calls _delete_allocation_after_move() ...looks up nodename in resource tracker ...doesn't find it... ComputeHostNotFound
The bug here seems to be that _get_disk_over_committed_size_total can't take into account that an instance's files may have been moved because of a resize. We could fix the immediate problem by simply causing _get_disk_over_committed_size_total() not to fail entirely if it can't find a single disk, which we should likely do anyway. However, this would still leave the less serious bug that we wouldn't be accounting for space still in use by _resize files.
This smells like a duplicate of the race when deleting we just fixed in master: libvirt: handle DiskNotFound during update_available_resource https://review.openstack.org/#/c/553067/ update_available_resource_for_node racing instance deletion https://bugs.launchpad.net/nova/+bug/1662867 Backporting this upstream now: https://review.openstack.org/#/q/Icec2769bf42455853cbe686fb30fda73df791b25
(In reply to Lee Yarwood from comment #4) > This smells like a duplicate of the race when deleting we just fixed in > master: > > libvirt: handle DiskNotFound during update_available_resource > https://review.openstack.org/#/c/553067/ > > update_available_resource_for_node racing instance deletion > https://bugs.launchpad.net/nova/+bug/1662867 > > Backporting this upstream now: > > https://review.openstack.org/#/q/Icec2769bf42455853cbe686fb30fda73df791b25 It's similar, but we discussed this upstream on Wednesday and it won't quite fix it. I noted this in the upstream LP bug https://bugs.launchpad.net/nova/+bug/1774249 . The problem is that there's no task state set between resize and confirm, so it will still raise the error. Adding more cruft to that test may be the way forward (I certainly don't discount it), but I'm concerned it's getting a bit too sticky tape, and it requires non-local context. I think it's worth having a look for a more elegant solution first.
We would like to know the current status of this ticket because gerrit 571410 hasn't been updated since July. We hit live-migration failure caused by this problem. It seems that this problem makes nova-compute unstable and this bug causes other various problem besides the failure of resize-confirm. We are hoping that the issue will be fixed as soon as possible for RHOSP13, which is a long-term support version and our customers are planning to use it. Would it be possible? The detail is below. The following problems occurred after restarting nova-compute prior to executing resize-confirm. - live-migration fails in _post_live_migration() due to ComputeHostNotFound. 2018-09-19 19:38:36.032 1 ERROR ... Traceback (most recent call last): 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7407, in _live_migration 2018-09-19 19:38:36.032 1 ERROR ... finish_event, disk_paths) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7271, in _live_migration_monitor 2018-09-19 19:38:36.032 1 ERROR ... migrate_data) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 76, in wrapped 2018-09-19 19:38:36.032 1 ERROR ... function_name, call_dict, binary) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-09-19 19:38:36.032 1 ERROR ... self.force_reraise() 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-09-19 19:38:36.032 1 ERROR ... six.reraise(self.type_, self.value, self.tb) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 67, in wrapped 2018-09-19 19:38:36.032 1 ERROR ... return f(self, context, *args, **kw) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 214, in decorated_function 2018-09-19 19:38:36.032 1 ERROR ... kwargs['instance'], e, sys.exc_info()) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-09-19 19:38:36.032 1 ERROR ... self.force_reraise() 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-09-19 19:38:36.032 1 ERROR ... six.reraise(self.type_, self.value, self.tb) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function 2018-09-19 19:38:36.032 1 ERROR ... return function(self, context, *args, **kwargs) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6341, in _post_live_migration 2018-09-19 19:38:36.032 1 ERROR ... source_node) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3790, in _delete_allocation_after_move 2018-09-19 19:38:36.032 1 ERROR ... cn_uuid = rt.get_node_uuid(nodename) 2018-09-19 19:38:36.032 1 ERROR ... File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 155, in get_node_uuid 2018-09-19 19:38:36.032 1 ERROR ... raise exception.ComputeHostNotFound(host=nodename) 2018-09-19 19:38:36.032 1 ERROR ... ComputeHostNotFound: Compute host wi0hvrtg03.localdomain could not be found. - instance_claim(), it's always called when booting a new instance, works incorrectly. In a normal case, instance_claim() checks resources of the host. ==== nova-compute.log ===== 2018-09-20 08:38:12.107 6810 DEBUG ... Lock "compute_resources" acquired by "nova.compute.resource_tracker.instance_claim" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-09-20 08:38:12.109 6810 DEBUG ... Memory overhead for 512 MB instance; 0 MB instance_claim /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:202 2018-09-20 08:38:12.109 6810 DEBUG ... Disk overhead for 1 GB instance; 0 GB instance_claim /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:205 2018-09-20 08:38:12.109 6810 DEBUG ... CPU overhead for 1 vCPUs instance; 0 vCPU(s) instance_claim /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:208 ... 2018-09-20 08:38:12.388 6810 DEBUG ... Lock "compute_resources" released by "nova.compute.resource_tracker.instance_claim" :: held 0.282s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 =========================== But after restarting nova-compute prior to executing resize-confirm, instance_claim() returns immediately without checking resources. ==== nova-compute.log ===== 2018-09-20 18:38:24.785 11333 DEBUG ... Lock "compute_resources" acquired by "nova.compute.resource_tracker.instance_claim" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-09-20 18:38:24.864 11333 DEBUG ... Lock "compute_resources" released by "nova.compute.resource_tracker.instance_claim" :: held 0.079s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 =========================== At least, we confirmed that cpu pinning feature(hw:cpu_policy='dedicated') becomes unable to work correctly in that situation. update_available_resource() becomes completely unable to work after restarting nova-compute prior to executing resize-confirm. DiskNotFound exception occurs before update_available_resource() calls _init_compute_node(), so nova-compute can't initialize itself forever. Nevertheless nova-compute reports its status as 'up' to controller. Users can't notice that some problem is occurring on nova-compute. nova-compute should call _init_compute_node() even when DiskNotFound occurred, or at least nova-compute should prevent reporting its status when _init_compute_node() hasn't been called yet. FYI, I wrote the steps to reproduce the failure of live-migration below. 1. Boot a new instance on hostA. # image=<image uuid> # net=<net uuid> # host=<hostA> # nova boot --poll --image $image --flavor 1 --nic net-id=$net --availability-zone nova:$host testvm 2. Migrate the instance to another host. # nova migrate testvm 3. Restart nova-compute on the source host before executing resize-confirm. # docker restart nova_compute 4. Boot another instance on hostA. # nova boot --poll --image $image --flavor 1 --nic net-id=$net --availability-zone nova:$host testvm2 5. Do live-migration the instance. # nova live-migration testvm2
Validation steps: 1. Create 2 instances, A and B, on compute-0 2. Cold migrate instance A to compute-1, but do not confirm it 3. Restart nova-compute on compute-0 Validation 1: Live migration instance B to compute-1 Validation 2: confirm cold migration of instance A
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1670