1584315 – [OSP14] DiskNotFound error during instance resize-confirm after restarting nova compute

Bug 1584315 - [OSP14] DiskNotFound error during instance resize-confirm after restarting nova compute

Summary: [OSP14] DiskNotFound error during instance resize-confirm after restarting no...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z3
Target Release:	14.0 (Rocky)
Assignee:	Lee Yarwood
QA Contact:	Archit Modi
Docs Contact:
URL:
Whiteboard:
Depends On:	1712812
Blocks:	1712813
TreeView+	depends on / blocked

Reported:	2018-05-30 17:14 UTC by Archit Modi
Modified:	2019-09-09 13:30 UTC (History)
CC List:	13 users (show)
Fixed In Version:	openstack-nova-18.2.1-0.20190509150811.8e130e2.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1712812 1712813 (view as bug list)
Environment:
Last Closed:	2019-07-02 19:45:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1774249	None	None	None	2018-05-30 18:41:29 UTC
Launchpad	1774252	None	None	None	2018-05-30 18:49:16 UTC
OpenStack gerrit	571410	'None'	MERGED	libvirt: Do not reraise DiskNotFound exceptions during resize	2021-01-19 03:32:06 UTC
Red Hat Product Errata	RHBA-2019:1670	None	None	None	2019-07-02 19:45:32 UTC

Description Archit Modi 2018-05-30 17:14:19 UTC

Description of problem: DiskNotFound error while running nova resize-confirm (Rhos-13) on instances which were resized on Rhos-12.


Version-Release number of selected component (if applicable): None


How reproducible: Always


Steps to Reproduce:
1. Resize instance on Rhos-12 (instance moved to VERIFY_RESIZE state)
2. Upgrade from Rhos-12 to Rhos-13
3. nova resize-confirm <instance>

Actual results:
(overcloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+------------+---------------+------------------------------------+--------+----------+
| ID                                   | Name       | Status        | Networks                           | Image  | Flavor   |
+--------------------------------------+------------+---------------+------------------------------------+--------+----------+
| 1374133a-2c08-4a8f-94f6-729d4e58d7e0 | test-10831 | VERIFY_RESIZE | private=192.168.100.6, 10.0.0.218  | cirros | m1.small |
| f3ed9015-3984-43f4-b4a5-c2898052b47d | test-16556 | VERIFY_RESIZE | private=192.168.100.11, 10.0.0.213 | cirros | m1.small |
+--------------------------------------+------------+---------------+------------------------------------+--------+----------+
(overcloud) [stack@undercloud-0 ~]$ nova resize-confirm test-10831
(overcloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+------------+---------------+------------------------------------+--------+----------+
| ID                                   | Name       | Status        | Networks                           | Image  | Flavor   |
+--------------------------------------+------------+---------------+------------------------------------+--------+----------+
| 1374133a-2c08-4a8f-94f6-729d4e58d7e0 | test-10831 | ERROR         | private=192.168.100.6, 10.0.0.218  | cirros | m1.small |
| f3ed9015-3984-43f4-b4a5-c2898052b47d | test-16556 | VERIFY_RESIZE | private=192.168.100.11, 10.0.0.213 | cirros | m1.small |
+--------------------------------------+------------+---------------+------------------------------------+--------+----------+
(overcloud) [stack@undercloud-0 ~]$ openstack server show test-10831
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| Field                               | Value                                                                                                                      |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                                                     |
| OS-EXT-AZ:availability_zone         | nova                                                                                                                       |
| OS-EXT-SRV-ATTR:host                | compute-0.localdomain                                                                                                      |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-0.localdomain                                                                                                      |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000003                                                                                                          |
| OS-EXT-STS:power_state              | Running                                                                                                                    |
| OS-EXT-STS:task_state               | None                                                                                                                       |
| OS-EXT-STS:vm_state                 | error                                                                                                                      |
| OS-SRV-USG:launched_at              | 2018-05-30T02:15:31.000000                                                                                                 |
| OS-SRV-USG:terminated_at            | None                                                                                                                       |
| accessIPv4                          |                                                                                                                            |
| accessIPv6                          |                                                                                                                            |
| addresses                           | private=192.168.100.6, 10.0.0.218                                                                                          |
| config_drive                        |                                                                                                                            |
| created                             | 2018-05-30T02:13:27Z                                                                                                       |
| fault                               | {u'message': u'Compute host compute-1.localdomain could not be found.', u'code': 404, u'created': u'2018-05-30T13:42:19Z'} |
| flavor                              | m1.small (b8e997d1-222b-43f0-96b8-62349961ce35)                                                                            |
| hostId                              | d414005f2ca351cb7cb5792ca59deb8e75611f47f11af3b13467f50b                                                                   |
| id                                  | 1374133a-2c08-4a8f-94f6-729d4e58d7e0                                                                                       |
| image                               | cirros (78cd87c9-4815-4579-a227-c38c38516073)                                                                              |
| key_name                            | None                                                                                                                       |
| name                                | test-10831                                                                                                                 |
| project_id                          | d4483d13a6ef41b2ae575ddbd0c59141                                                                                           |
| properties                          |                                                                                                                            |
| security_groups                     | name='default'                                                                                                             |
| status                              | ERROR                                                                                                                      |
| updated                             | 2018-05-30T13:42:19Z                                                                                                       |
| user_id                             | 09abbd4893bb45eea8fb1d5e40635339                                                                                           |
| volumes_attached                    |                                                                                                                            |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------+

From the logs: 
./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 147, in get_allocated_disk_size
./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager     return images.qemu_img_info(path).disk_size
./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info
./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager     raise exception.DiskNotFound(location=path)
./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk
./nova-compute.log:2018-05-30 17:04:17.608 1 ERROR nova.compute.manager 



Expected results:
Instance successfully resized


Additional info:

[heat-admin@compute-0 ~]$ sudo virsh list
 Id    Name                           State
----------------------------------------------------
 2     instance-00000003              running
 
[heat-admin@compute-0 ~]$ sudo virsh domblk list instance-00000003
error: unknown command: 'domblk'
[heat-admin@compute-0 ~]$ sudo virsh domblklist instance-00000003
Target     Source
------------------------------------------------
vda        /var/lib/nova/instances/1374133a-2c08-4a8f-94f6-729d4e58d7e0/disk

Comment 1 Matthew Booth 2018-05-30 17:51:05 UTC

This isn't actually upgrade related because it happens *before* the upgrade, and should be reproducible without an upgrade. The bug flow is as follows:

Create instance A on compute X

Resize instance A to compute Y
  Domain is powered off
  /var/lib/nova/instances/<uuid A> renamed to <uuid A>_resize on X
  Domain is *not* undefined

Resize instance B to compute X
  [as above]

On compute X:
  update_available_resource runs as a periodic task
  First action is to update self
  rt calls driver.get_available_resource()
  ...calls _get_disk_over_committed_size_total
  ...iterates over all defined domains, including the ones whose disks we renamed
  ...fails because a referenced disk no longer exists
  ...consequently never updates RT with entry for current nodename

However, note that at this point an RT entry exists anyway, because it previously ran successfully at least once.

Restart both X and Y

On compute X:
  update_available_resource runs as a period task
  update_available_resource continues to fail as above

However, note that this time because we restarted after the error condition already existed, we never populate RT at all.

Resize confirm instance A on compute Y
...calls confirm_resize on compute X
...calls _delete_allocation_after_move()
...looks up nodename in resource tracker
...doesn't find it... ComputeHostNotFound

Comment 3 Matthew Booth 2018-05-30 17:57:42 UTC

The bug here seems to be that _get_disk_over_committed_size_total can't take into account that an instance's files may have been moved because of a resize.

We could fix the immediate problem by simply causing _get_disk_over_committed_size_total() not to fail entirely if it can't find a single disk, which we should likely do anyway. However, this would still leave the less serious bug that we wouldn't be accounting for space still in use by _resize files.

Comment 4 Lee Yarwood 2018-05-31 10:00:26 UTC

This smells like a duplicate of the race when deleting we just fixed in master:

libvirt: handle DiskNotFound during update_available_resource
https://review.openstack.org/#/c/553067/

update_available_resource_for_node racing instance deletion
https://bugs.launchpad.net/nova/+bug/1662867

Backporting this upstream now:

https://review.openstack.org/#/q/Icec2769bf42455853cbe686fb30fda73df791b25

Comment 5 Matthew Booth 2018-06-01 08:42:47 UTC

(In reply to Lee Yarwood from comment #4)
> This smells like a duplicate of the race when deleting we just fixed in
> master:
> 
> libvirt: handle DiskNotFound during update_available_resource
> https://review.openstack.org/#/c/553067/
> 
> update_available_resource_for_node racing instance deletion
> https://bugs.launchpad.net/nova/+bug/1662867
> 
> Backporting this upstream now:
> 
> https://review.openstack.org/#/q/Icec2769bf42455853cbe686fb30fda73df791b25

It's similar, but we discussed this upstream on Wednesday and it won't quite fix it. I noted this in the upstream LP bug https://bugs.launchpad.net/nova/+bug/1774249 . The problem is that there's no task state set between resize and confirm, so it will still raise the error.

Adding more cruft to that test may be the way forward (I certainly don't discount it), but I'm concerned it's getting a bit too sticky tape, and it requires non-local context. I think it's worth having a look for a more elegant solution first.

Comment 9 Masaki Hatada 2018-09-21 00:57:27 UTC

We would like to know the current status of this ticket because gerrit 571410 hasn't been updated since July.

We hit live-migration failure caused by this problem.
It seems that this problem makes nova-compute unstable and this bug causes other various problem besides the failure of resize-confirm.

We are hoping that the issue will be fixed as soon as possible for RHOSP13, which is a long-term support version and our customers are planning to use it.
Would it be possible?

The detail is below.

The following problems occurred after restarting nova-compute prior to executing resize-confirm.

  - live-migration fails in _post_live_migration() due to ComputeHostNotFound.

    2018-09-19 19:38:36.032 1 ERROR ... Traceback (most recent call last):
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7407, in _live_migration
    2018-09-19 19:38:36.032 1 ERROR ...     finish_event, disk_paths)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7271, in _live_migration_monitor
    2018-09-19 19:38:36.032 1 ERROR ...     migrate_data)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 76, in wrapped
    2018-09-19 19:38:36.032 1 ERROR ...     function_name, call_dict, binary)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    2018-09-19 19:38:36.032 1 ERROR ...     self.force_reraise()
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    2018-09-19 19:38:36.032 1 ERROR ...     six.reraise(self.type_, self.value, self.tb)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 67, in wrapped
    2018-09-19 19:38:36.032 1 ERROR ...     return f(self, context, *args, **kw)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 214, in decorated_function
    2018-09-19 19:38:36.032 1 ERROR ...     kwargs['instance'], e, sys.exc_info())
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    2018-09-19 19:38:36.032 1 ERROR ...     self.force_reraise()
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    2018-09-19 19:38:36.032 1 ERROR ...     six.reraise(self.type_, self.value, self.tb)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function
    2018-09-19 19:38:36.032 1 ERROR ...     return function(self, context, *args, **kwargs)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6341, in _post_live_migration
    2018-09-19 19:38:36.032 1 ERROR ...     source_node)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3790, in _delete_allocation_after_move
    2018-09-19 19:38:36.032 1 ERROR ...     cn_uuid = rt.get_node_uuid(nodename)
    2018-09-19 19:38:36.032 1 ERROR ...   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 155, in get_node_uuid
    2018-09-19 19:38:36.032 1 ERROR ...     raise exception.ComputeHostNotFound(host=nodename)
    2018-09-19 19:38:36.032 1 ERROR ... ComputeHostNotFound: Compute host wi0hvrtg03.localdomain could not be found.

  - instance_claim(), it's always called when booting a new instance, works incorrectly.

    In a normal case, instance_claim() checks resources of the host.

      ==== nova-compute.log =====
      2018-09-20 08:38:12.107 6810 DEBUG ... Lock "compute_resources" acquired by "nova.compute.resource_tracker.instance_claim" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
      2018-09-20 08:38:12.109 6810 DEBUG ... Memory overhead for 512 MB instance; 0 MB instance_claim /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:202
      2018-09-20 08:38:12.109 6810 DEBUG ... Disk overhead for 1 GB instance; 0 GB instance_claim /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:205
      2018-09-20 08:38:12.109 6810 DEBUG ... CPU overhead for 1 vCPUs instance; 0 vCPU(s) instance_claim /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:208
      ...
      2018-09-20 08:38:12.388 6810 DEBUG ... Lock "compute_resources" released by "nova.compute.resource_tracker.instance_claim" :: held 0.282s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
      ===========================

    But after restarting nova-compute prior to executing resize-confirm,
    instance_claim() returns immediately without checking resources.

      ==== nova-compute.log =====
      2018-09-20 18:38:24.785 11333 DEBUG ... Lock "compute_resources" acquired by "nova.compute.resource_tracker.instance_claim" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
      2018-09-20 18:38:24.864 11333 DEBUG ... Lock "compute_resources" released by "nova.compute.resource_tracker.instance_claim" :: held 0.079s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
      ===========================

    At least, we confirmed that cpu pinning feature(hw:cpu_policy='dedicated') becomes unable to work correctly in that situation.

update_available_resource() becomes completely unable to work after restarting nova-compute prior to executing resize-confirm.
DiskNotFound exception occurs before update_available_resource() calls _init_compute_node(), so nova-compute can't initialize itself forever.
Nevertheless nova-compute reports its status as 'up' to controller.
Users can't notice that some problem is occurring on nova-compute.

nova-compute should call _init_compute_node() even when DiskNotFound occurred,
or at least nova-compute should prevent reporting its status when _init_compute_node() hasn't been called yet.

FYI, I wrote the steps to reproduce the failure of live-migration below.

  1. Boot a new instance on hostA.

    # image=<image uuid>
    # net=<net uuid>
    # host=<hostA>
    # nova boot --poll --image $image --flavor 1 --nic net-id=$net --availability-zone nova:$host testvm

  2. Migrate the instance to another host.

    # nova migrate testvm

  3. Restart nova-compute on the source host before executing resize-confirm.

    # docker restart nova_compute

  4. Boot another instance on hostA.

    # nova boot --poll --image $image --flavor 1 --nic net-id=$net --availability-zone nova:$host testvm2

  5. Do live-migration the instance.

    # nova live-migration testvm2

Comment 19 Matthew Booth 2019-06-19 16:01:27 UTC

Validation steps:

1. Create 2 instances, A and B, on compute-0
2. Cold migrate instance A to compute-1, but do not confirm it
3. Restart nova-compute on compute-0

Validation 1: Live migration instance B to compute-1
Validation 2: confirm cold migration of instance A

Comment 25 errata-xmlrpc 2019-07-02 19:45:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1670

Note You need to log in before you can comment on or make changes to this bug.