Bug 1669225

Summary:	[BACKPORT Request] Nova returns a traceback when it's unable to detach a volume still in use
Product:	Red Hat OpenStack	Reporter:	David Vallee Delisle <dvd>
Component:	openstack-nova	Assignee:	Matthew Booth <mbooth>
Status:	CLOSED ERRATA	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	10.0 (Newton)	CC:	cmuresan, dasmith, dcadzow, eglynn, gkadam, jhakimra, jhardee, kchamart, lyarwood, mbooth, mburns, mwitt, ramishra, sbaker, sbauza, sgordon, shardy, vromanso
Target Milestone:	async	Keywords:	Patch, Triaged, ZStream
Target Release:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-nova-14.1.0-54.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1551733	Environment:
Last Closed:	2019-09-03 16:36:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1551733
Bug Blocks:	1557938

Description David Vallee Delisle 2019-01-24 16:16:41 UTC

+++ This bug was initially created as a clone of Bug #1551733 +++

Description
===========
If libvirt is unable to detach a volume because it's still in-use by the guest (either mounted and/or file opened), nova returns a traceback.

Steps to reproduce
==================

* Create an instance with volume attached using heat
* Make sure there's activity on the volume
* Delete stack

Expected result
===============
We would expect nova to not return a traceback but a clean log about its incapacity to detach volume. If possible, that would be great if that exception was raised back to either cinder or heat.

Actual result
=============
```
21495 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall [-] Dynamic interval looping call 'oslo_service.loopingcall._func' failed: DeviceDetachFailed: Device detach failed for vdf: Unable to detach from guest transient domain.
21496 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall Traceback (most recent call last):
21497 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 137, in _run_loop
21498 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
21499 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 415, in _func
21500 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall return self._sleep_time
21501 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
21502 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall self.force_reraise()
21503 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
21504 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb)
21505 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 394, in _func
21506 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall result = f(*args, **kwargs)
21507 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 462, in _do_wait_and_retry_detach
21508 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall device=alternative_device_name, reason=reason)
21509 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall DeviceDetachFailed: Device detach failed for vdf: Unable to detach from guest transient domain.
```

Environment
===========
* Red Hat Openstack 10
```
libvirt-3.2.0-14.el7_4.3.x86_64                             Wed Nov  8 17:13:02 2017
libvirt-client-3.2.0-14.el7_4.3.x86_64                      Wed Nov  8 17:04:54 2017
libvirt-daemon-3.2.0-14.el7_4.3.x86_64                      Wed Nov  8 17:04:50 2017
libvirt-daemon-config-network-3.2.0-14.el7_4.3.x86_64       Wed Nov  8 17:04:55 2017
libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.3.x86_64      Wed Nov  8 17:04:54 2017
libvirt-daemon-driver-interface-3.2.0-14.el7_4.3.x86_64     Wed Nov  8 17:04:54 2017
libvirt-daemon-driver-lxc-3.2.0-14.el7_4.3.x86_64           Wed Nov  8 17:04:55 2017
libvirt-daemon-driver-network-3.2.0-14.el7_4.3.x86_64       Wed Nov  8 17:04:51 2017
libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.3.x86_64       Wed Nov  8 17:04:53 2017
libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.3.x86_64      Wed Nov  8 17:04:51 2017
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64          Wed Nov  8 17:04:52 2017
libvirt-daemon-driver-secret-3.2.0-14.el7_4.3.x86_64        Wed Nov  8 17:04:53 2017
libvirt-daemon-driver-storage-3.2.0-14.el7_4.3.x86_64       Wed Nov  8 17:05:01 2017
libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.3.x86_64  Wed Nov  8 17:04:51 2017
libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.3.x86_64  Wed Nov  8 17:05:01 2017
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64 Wed Nov  8 17:05:00 2017
libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.3.x86_64 Wed Nov  8 17:04:58 2017
libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.3.x86_64 Wed Nov  8 17:05:00 2017
libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.3.x86_64 Wed Nov  8 17:04:59 2017
libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.3.x86_64   Wed Nov  8 17:04:59 2017
libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.3.x86_64  Wed Nov  8 17:05:00 2017
libvirt-daemon-kvm-3.2.0-14.el7_4.3.x86_64                  Wed Nov  8 17:10:02 2017
libvirt-libs-3.2.0-14.el7_4.3.x86_64                        Wed Nov  8 17:04:50 2017
libvirt-python-3.2.0-3.el7_4.1.x86_64                       Wed Nov  8 17:04:52 2017
openstack-nova-api-14.0.8-5.el7ost.noarch                   Wed Nov  8 17:12:52 2017
openstack-nova-cert-14.0.8-5.el7ost.noarch                  Wed Nov  8 17:12:53 2017
openstack-nova-common-14.0.8-5.el7ost.noarch                Wed Nov  8 17:09:57 2017
openstack-nova-compute-14.0.8-5.el7ost.noarch               Wed Nov  8 17:10:15 2017
openstack-nova-conductor-14.0.8-5.el7ost.noarch             Wed Nov  8 17:12:51 2017
openstack-nova-console-14.0.8-5.el7ost.noarch               Wed Nov  8 17:12:52 2017
openstack-nova-migration-14.0.8-5.el7ost.noarch             Wed Nov  8 17:12:45 2017
openstack-nova-novncproxy-14.0.8-5.el7ost.noarch            Wed Nov  8 17:12:52 2017
openstack-nova-scheduler-14.0.8-5.el7ost.noarch             Wed Nov  8 17:12:52 2017
puppet-nova-9.6.0-2.el7ost.noarch                           Wed Nov  8 17:19:32 2017
python-nova-14.0.8-5.el7ost.noarch                          Wed Nov  8 17:09:56 2017
python-novaclient-6.0.2-1.el7ost.noarch                     Wed Nov  8 16:52:12 2017
qemu-guest-agent-2.8.0-2.el7.x86_64                         Wed Nov  8 16:17:13 2017
qemu-img-rhev-2.9.0-10.el7.x86_64                           Wed Nov  8 17:03:28 2017
qemu-kvm-common-rhev-2.9.0-10.el7.x86_64                    Wed Nov  8 17:04:46 2017
qemu-kvm-rhev-2.9.0-10.el7.x86_64                           Wed Nov  8 17:10:02 2017

```

Comment 6 Matthew Booth 2019-06-14 16:14:54 UTC

There's nothing here specific to heat, so I'm going to simplify the reproducer steps to Nova:

* Create an instance with a volume attached
* Log in to the instance and mount the volume
* Detach the volume

Confirm:
* Nova logs 'Guest refused to detach volume'
* Volume returns to in-use and is no longer detaching.

Comment 7 Matthew Booth 2019-06-14 17:16:04 UTC

Initial testing note: Cirros doesn't appear to prevent removal of the volume even when it's mounted with a process actively writing to an open file descriptor. Can't immediately reproduce.

Comment 8 Matthew Booth 2019-06-17 11:10:52 UTC

I have retried this with a CentOS 7 image, and I am still unable to make the qemu refuse to detach the disk. This is what I tried:

[stack@undercloud-0 ~]$ openstack server create --image CentOS-7-x86_64-GenericCloud.qcow2c --flavor m1.me
dium --nic net-id=private --key-name defaultkey testinstance1
+--------------------------------------+-----------------------------------------------------------------+
| Field                                | Value                                                           |
+--------------------------------------+-----------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                          |
| OS-EXT-AZ:availability_zone          |                                                                 |
| OS-EXT-SRV-ATTR:host                 | None                                                            |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | None                                                            |
| OS-EXT-SRV-ATTR:instance_name        |                                                                 |
| OS-EXT-STS:power_state               | NOSTATE                                                         |
| OS-EXT-STS:task_state                | scheduling                                                      |
| OS-EXT-STS:vm_state                  | building                                                        |
| OS-SRV-USG:launched_at               | None                                                            |
| OS-SRV-USG:terminated_at             | None                                                            |
| accessIPv4                           |                                                                 |
| accessIPv6                           |                                                                 |
| addresses                            |                                                                 |
| adminPass                            | NwC35m3Cy5vj                                                    |
| config_drive                         |                                                                 |
| created                              | 2019-06-17T10:32:32Z                                            |
| flavor                               | m1.medium (1582e22e-a904-4b18-aa0c-0d1b094d3224)                |
| hostId                               |                                                                 |
| id                                   | 8b3c4755-d652-4cad-be83-6d0d064cec7e                            |
| image                                | CentOS-7-x86_64-GenericCloud.qcow2c (fdda862a-957a-             |
|                                      | 4b23-8fb1-fbd98b192913)                                         |
| key_name                             | defaultkey                                                      |
| name                                 | testinstance1                                                   |
| os-extended-volumes:volumes_attached | []                                                              |
| progress                             | 0                                                               |
| project_id                           | 9f8895b7cb374eae8531bd7f04276447                                |
| properties                           |                                                                 |
| security_groups                      | [{u'name': u'default'}]                                         |
| status                               | BUILD                                                           |
| updated                              | 2019-06-17T10:32:33Z                                            |
| user_id                              | 837fb6d1122e410fb9c93a21002207c9                                |
+--------------------------------------+-----------------------------------------------------------------+
[stack@undercloud-0 ~]$ openstack server add floating ip 8b3c4755-d652-4cad-be83-6d0d064cec7e 10.0.0.215
[stack@undercloud-0 ~]$ openstack volume create --size 1 testvol1
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| attachments         | []                                   |
| availability_zone   | nova                                 |
| bootable            | false                                |
| consistencygroup_id | None                                 |
| created_at          | 2019-06-17T10:38:48.180819           |
| description         | None                                 |
| encrypted           | False                                |
| id                  | 08284d9a-cf7a-4af7-8da4-86eb511d2464 |
| migration_status    | None                                 |
| multiattach         | False                                |
| name                | testvol1                             |
| properties          |                                      |
| replication_status  | disabled                             |
| size                | 1                                    |
| snapshot_id         | None                                 |
| source_volid        | None                                 |
| status              | creating                             |
| type                | None                                 |
| updated_at          | None                                 |
| user_id             | 837fb6d1122e410fb9c93a21002207c9     |
+---------------------+--------------------------------------+


[stack@undercloud-0 ~]$ ssh centos.0.215
Warning: Permanently added '10.0.0.215' (ECDSA) to the list of known hosts.
[centos@testinstance1 ~]$ sudo su -
[root@testinstance1 ~]# mke2fs /dev/vdb
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
[root@testinstance1 ~]# mount /dev/vdb /mnt
[root@testinstance1 ~]# cd /mnt
[root@testinstance1 mnt]# touch foo


[stack@undercloud-0 ~]$ openstack server remove volume 8b3c4755-d652-4cad-be83-6d0d064cec7e 08284d9a-cf7a-4af7-8da4-86eb511d2464
[stack@undercloud-0 ~]$ openstack volume list
+--------------------------------------+--------------+-----------+------+-------------+
| ID                                   | Display Name | Status    | Size | Attached to |
+--------------------------------------+--------------+-----------+------+-------------+
| 08284d9a-cf7a-4af7-8da4-86eb511d2464 | testvol1     | available |    1 |             |
+--------------------------------------+--------------+-----------+------+-------------+


[root@testinstance1 mnt]# ls
ls: reading directory .: Input/output error


I also tried again with a process writing data to the volume:

# while true; do date; sleep 1; done | tee foo

The result was the same: the volume was removed. dmesg in the guest does not note any activity prior to the removal of the device.


[  418.589239] pci 0000:00:06.0: [1af4:1001] type 00 class 0x010000
[  418.590872] pci 0000:00:06.0: reg 0x10: [io  0x0000-0x003f]
[  418.591447] pci 0000:00:06.0: reg 0x14: [mem 0x00000000-0x00000fff]
[  418.593578] pci 0000:00:06.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
[  418.603881] pci 0000:00:06.0: BAR 4: assigned [mem 0x100000000-0x100003fff 64bit pref]
[  418.616969] pci 0000:00:06.0: BAR 1: assigned [mem 0x80000000-0x80000fff]
[  418.628270] pci 0000:00:06.0: BAR 0: assigned [io  0x1000-0x103f]
[  418.653220] virtio-pci 0000:00:06.0: enabling device (0000 -> 0003)
[  418.701889] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11
[  418.724575] virtio-pci 0000:00:06.0: irq 29 for MSI/MSI-X
[  418.724793] virtio-pci 0000:00:06.0: irq 30 for MSI/MSI-X
[  418.729286] virtio_blk virtio3: [vdb] 2097152 512-byte logical blocks (1.07 GB/1.00 GiB)
[  522.679222] EXT4-fs (vdb): mounting ext2 file system using the ext4 subsystem
[  522.712497] EXT4-fs (vdb): mounted filesystem without journal. Opts: (null)
[  732.365265] EXT4-fs warning (device vdb): __ext4_read_dirblock:902: error reading directory block (ino
2, block 0)
[  732.378969] EXT4-fs error (device vdb): __ext4_get_inode_loc:4239: inode #2: block 67: comm ls: unable
to read itable block
[  732.393652] EXT4-fs error (device vdb) in ext4_reserve_inode_write:5246: IO failure
[  762.448359] EXT4-fs error (device vdb): __ext4_get_inode_loc:4239: inode #2: block 67: comm kworker/u2:
2: unable to read itable block

Comment 9 Matthew Booth 2019-06-17 11:49:10 UTC

Note that this backport requires an additional backport of https://review.opendev.org/#/c/590439/ or the volume will not return to the in-use state even if we do work out how to reproduce this.

Comment 27 Matthew Booth 2019-07-11 12:33:27 UTC

Here's a local demonstration of the issue:

=== Define a local disk
# cat disk.xml
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/test2.qcow2'/>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
    </disk>

=== Attach it to a test domain
# virsh attach-device test --live --config disk.xml
Device attached successfully

=== Detach the config only -> SUCCESS
# virsh detach-device test --config disk.xml
Device detached successfully

=== Attempt to detach config again -> no target device
# virsh detach-device test --config disk.xml
error: Failed to detach device from disk.xml
error: device not found: no target device vdb

=== Attempt to detach both live and config -> no target device
# virsh detach-device test --live --config disk.xml
error: Failed to detach device from disk.xml
error: device not found: no target device vdb

=== Attempt to detach live only -> SUCCESS
# virsh detach-device test --live disk.xml
Device detached successfully

=== Attempt to detach live again -> disk not found
# virsh detach-device test --live disk.xml
error: Failed to detach device from disk.xml
error: operation failed: disk vdb not found

Comment 28 Matthew Booth 2019-07-11 13:01:20 UTC

https://review.opendev.org/#/c/584433/ removed the reference to the transient domain upstream.

Comment 29 Matthew Booth 2019-07-11 13:33:42 UTC

<danpb> mdbooth: it will only save the changes to the inactive config, if the live config was successfully updated
<mdbooth> danpb: And it's synchronous, right? i.e. if it returns without an error that means it has succeeded?
<mdbooth> Returns 0 in case of success, -1 in case of failure.
<mdbooth> success == device *has been removed*, not *has been scheduled for removal, which may fail later*
<danpb> mdbooth: no, if it returns success there is still a possibility that the device stll exists in the guest
<danpb> mdbooth: because the actual hotunplug is async
<mdbooth> danpb: That explains what the nova libvirt driver is doing, then
<danpb> i guess I should have said 
<danpb> it will only save the changes to the inactive config, if the live config was successfully /requested/ to unplug the device

Comment 39 Matthew Booth 2019-07-22 10:02:07 UTC

QA: PLEASE IGNORE COMMENTS 12 TO 38 ENTIRELY! They are unrelated to this bug.

Comment 47 errata-xmlrpc 2019-09-03 16:36:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2631