+++ This bug was initially created as a clone of Bug #1551733 +++ Description =========== If libvirt is unable to detach a volume because it's still in-use by the guest (either mounted and/or file opened), nova returns a traceback. Steps to reproduce ================== * Create an instance with volume attached using heat * Make sure there's activity on the volume * Delete stack Expected result =============== We would expect nova to not return a traceback but a clean log about its incapacity to detach volume. If possible, that would be great if that exception was raised back to either cinder or heat. Actual result ============= ``` 21495 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall [-] Dynamic interval looping call 'oslo_service.loopingcall._func' failed: DeviceDetachFailed: Device detach failed for vdf: Unable to detach from guest transient domain. 21496 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall Traceback (most recent call last): 21497 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 137, in _run_loop 21498 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw) 21499 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 415, in _func 21500 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall return self._sleep_time 21501 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 21502 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall self.force_reraise() 21503 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 21504 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb) 21505 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 394, in _func 21506 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall result = f(*args, **kwargs) 21507 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 462, in _do_wait_and_retry_detach 21508 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall device=alternative_device_name, reason=reason) 21509 2018-02-14 20:31:09.735 1 ERROR oslo.service.loopingcall DeviceDetachFailed: Device detach failed for vdf: Unable to detach from guest transient domain. ``` Environment =========== * Red Hat Openstack 10 ``` libvirt-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:13:02 2017 libvirt-client-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:54 2017 libvirt-daemon-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:50 2017 libvirt-daemon-config-network-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:55 2017 libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:54 2017 libvirt-daemon-driver-interface-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:54 2017 libvirt-daemon-driver-lxc-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:55 2017 libvirt-daemon-driver-network-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:51 2017 libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:53 2017 libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:51 2017 libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:52 2017 libvirt-daemon-driver-secret-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:53 2017 libvirt-daemon-driver-storage-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:05:01 2017 libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:51 2017 libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:05:01 2017 libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:05:00 2017 libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:58 2017 libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:05:00 2017 libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:59 2017 libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:59 2017 libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:05:00 2017 libvirt-daemon-kvm-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:10:02 2017 libvirt-libs-3.2.0-14.el7_4.3.x86_64 Wed Nov 8 17:04:50 2017 libvirt-python-3.2.0-3.el7_4.1.x86_64 Wed Nov 8 17:04:52 2017 openstack-nova-api-14.0.8-5.el7ost.noarch Wed Nov 8 17:12:52 2017 openstack-nova-cert-14.0.8-5.el7ost.noarch Wed Nov 8 17:12:53 2017 openstack-nova-common-14.0.8-5.el7ost.noarch Wed Nov 8 17:09:57 2017 openstack-nova-compute-14.0.8-5.el7ost.noarch Wed Nov 8 17:10:15 2017 openstack-nova-conductor-14.0.8-5.el7ost.noarch Wed Nov 8 17:12:51 2017 openstack-nova-console-14.0.8-5.el7ost.noarch Wed Nov 8 17:12:52 2017 openstack-nova-migration-14.0.8-5.el7ost.noarch Wed Nov 8 17:12:45 2017 openstack-nova-novncproxy-14.0.8-5.el7ost.noarch Wed Nov 8 17:12:52 2017 openstack-nova-scheduler-14.0.8-5.el7ost.noarch Wed Nov 8 17:12:52 2017 puppet-nova-9.6.0-2.el7ost.noarch Wed Nov 8 17:19:32 2017 python-nova-14.0.8-5.el7ost.noarch Wed Nov 8 17:09:56 2017 python-novaclient-6.0.2-1.el7ost.noarch Wed Nov 8 16:52:12 2017 qemu-guest-agent-2.8.0-2.el7.x86_64 Wed Nov 8 16:17:13 2017 qemu-img-rhev-2.9.0-10.el7.x86_64 Wed Nov 8 17:03:28 2017 qemu-kvm-common-rhev-2.9.0-10.el7.x86_64 Wed Nov 8 17:04:46 2017 qemu-kvm-rhev-2.9.0-10.el7.x86_64 Wed Nov 8 17:10:02 2017 ```
There's nothing here specific to heat, so I'm going to simplify the reproducer steps to Nova: * Create an instance with a volume attached * Log in to the instance and mount the volume * Detach the volume Confirm: * Nova logs 'Guest refused to detach volume' * Volume returns to in-use and is no longer detaching.
Initial testing note: Cirros doesn't appear to prevent removal of the volume even when it's mounted with a process actively writing to an open file descriptor. Can't immediately reproduce.
I have retried this with a CentOS 7 image, and I am still unable to make the qemu refuse to detach the disk. This is what I tried: [stack@undercloud-0 ~]$ openstack server create --image CentOS-7-x86_64-GenericCloud.qcow2c --flavor m1.me dium --nic net-id=private --key-name defaultkey testinstance1 +--------------------------------------+-----------------------------------------------------------------+ | Field | Value | +--------------------------------------+-----------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | | | OS-EXT-SRV-ATTR:host | None | | OS-EXT-SRV-ATTR:hypervisor_hostname | None | | OS-EXT-SRV-ATTR:instance_name | | | OS-EXT-STS:power_state | NOSTATE | | OS-EXT-STS:task_state | scheduling | | OS-EXT-STS:vm_state | building | | OS-SRV-USG:launched_at | None | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | | | adminPass | NwC35m3Cy5vj | | config_drive | | | created | 2019-06-17T10:32:32Z | | flavor | m1.medium (1582e22e-a904-4b18-aa0c-0d1b094d3224) | | hostId | | | id | 8b3c4755-d652-4cad-be83-6d0d064cec7e | | image | CentOS-7-x86_64-GenericCloud.qcow2c (fdda862a-957a- | | | 4b23-8fb1-fbd98b192913) | | key_name | defaultkey | | name | testinstance1 | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | project_id | 9f8895b7cb374eae8531bd7f04276447 | | properties | | | security_groups | [{u'name': u'default'}] | | status | BUILD | | updated | 2019-06-17T10:32:33Z | | user_id | 837fb6d1122e410fb9c93a21002207c9 | +--------------------------------------+-----------------------------------------------------------------+ [stack@undercloud-0 ~]$ openstack server add floating ip 8b3c4755-d652-4cad-be83-6d0d064cec7e 10.0.0.215 [stack@undercloud-0 ~]$ openstack volume create --size 1 testvol1 +---------------------+--------------------------------------+ | Field | Value | +---------------------+--------------------------------------+ | attachments | [] | | availability_zone | nova | | bootable | false | | consistencygroup_id | None | | created_at | 2019-06-17T10:38:48.180819 | | description | None | | encrypted | False | | id | 08284d9a-cf7a-4af7-8da4-86eb511d2464 | | migration_status | None | | multiattach | False | | name | testvol1 | | properties | | | replication_status | disabled | | size | 1 | | snapshot_id | None | | source_volid | None | | status | creating | | type | None | | updated_at | None | | user_id | 837fb6d1122e410fb9c93a21002207c9 | +---------------------+--------------------------------------+ [stack@undercloud-0 ~]$ ssh centos.0.215 Warning: Permanently added '10.0.0.215' (ECDSA) to the list of known hosts. [centos@testinstance1 ~]$ sudo su - [root@testinstance1 ~]# mke2fs /dev/vdb mke2fs 1.42.9 (28-Dec-2013) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 65536 inodes, 262144 blocks 13107 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=268435456 8 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376 Allocating group tables: done Writing inode tables: done Writing superblocks and filesystem accounting information: done [root@testinstance1 ~]# mount /dev/vdb /mnt [root@testinstance1 ~]# cd /mnt [root@testinstance1 mnt]# touch foo [stack@undercloud-0 ~]$ openstack server remove volume 8b3c4755-d652-4cad-be83-6d0d064cec7e 08284d9a-cf7a-4af7-8da4-86eb511d2464 [stack@undercloud-0 ~]$ openstack volume list +--------------------------------------+--------------+-----------+------+-------------+ | ID | Display Name | Status | Size | Attached to | +--------------------------------------+--------------+-----------+------+-------------+ | 08284d9a-cf7a-4af7-8da4-86eb511d2464 | testvol1 | available | 1 | | +--------------------------------------+--------------+-----------+------+-------------+ [root@testinstance1 mnt]# ls ls: reading directory .: Input/output error I also tried again with a process writing data to the volume: # while true; do date; sleep 1; done | tee foo The result was the same: the volume was removed. dmesg in the guest does not note any activity prior to the removal of the device. [ 418.589239] pci 0000:00:06.0: [1af4:1001] type 00 class 0x010000 [ 418.590872] pci 0000:00:06.0: reg 0x10: [io 0x0000-0x003f] [ 418.591447] pci 0000:00:06.0: reg 0x14: [mem 0x00000000-0x00000fff] [ 418.593578] pci 0000:00:06.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] [ 418.603881] pci 0000:00:06.0: BAR 4: assigned [mem 0x100000000-0x100003fff 64bit pref] [ 418.616969] pci 0000:00:06.0: BAR 1: assigned [mem 0x80000000-0x80000fff] [ 418.628270] pci 0000:00:06.0: BAR 0: assigned [io 0x1000-0x103f] [ 418.653220] virtio-pci 0000:00:06.0: enabling device (0000 -> 0003) [ 418.701889] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11 [ 418.724575] virtio-pci 0000:00:06.0: irq 29 for MSI/MSI-X [ 418.724793] virtio-pci 0000:00:06.0: irq 30 for MSI/MSI-X [ 418.729286] virtio_blk virtio3: [vdb] 2097152 512-byte logical blocks (1.07 GB/1.00 GiB) [ 522.679222] EXT4-fs (vdb): mounting ext2 file system using the ext4 subsystem [ 522.712497] EXT4-fs (vdb): mounted filesystem without journal. Opts: (null) [ 732.365265] EXT4-fs warning (device vdb): __ext4_read_dirblock:902: error reading directory block (ino 2, block 0) [ 732.378969] EXT4-fs error (device vdb): __ext4_get_inode_loc:4239: inode #2: block 67: comm ls: unable to read itable block [ 732.393652] EXT4-fs error (device vdb) in ext4_reserve_inode_write:5246: IO failure [ 762.448359] EXT4-fs error (device vdb): __ext4_get_inode_loc:4239: inode #2: block 67: comm kworker/u2: 2: unable to read itable block
Note that this backport requires an additional backport of https://review.opendev.org/#/c/590439/ or the volume will not return to the in-use state even if we do work out how to reproduce this.
Here's a local demonstration of the issue: === Define a local disk # cat disk.xml <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/test2.qcow2'/> <backingStore/> <target dev='vdb' bus='virtio'/> </disk> === Attach it to a test domain # virsh attach-device test --live --config disk.xml Device attached successfully === Detach the config only -> SUCCESS # virsh detach-device test --config disk.xml Device detached successfully === Attempt to detach config again -> no target device # virsh detach-device test --config disk.xml error: Failed to detach device from disk.xml error: device not found: no target device vdb === Attempt to detach both live and config -> no target device # virsh detach-device test --live --config disk.xml error: Failed to detach device from disk.xml error: device not found: no target device vdb === Attempt to detach live only -> SUCCESS # virsh detach-device test --live disk.xml Device detached successfully === Attempt to detach live again -> disk not found # virsh detach-device test --live disk.xml error: Failed to detach device from disk.xml error: operation failed: disk vdb not found
https://review.opendev.org/#/c/584433/ removed the reference to the transient domain upstream.
<danpb> mdbooth: it will only save the changes to the inactive config, if the live config was successfully updated <mdbooth> danpb: And it's synchronous, right? i.e. if it returns without an error that means it has succeeded? <mdbooth> Returns 0 in case of success, -1 in case of failure. <mdbooth> success == device *has been removed*, not *has been scheduled for removal, which may fail later* <danpb> mdbooth: no, if it returns success there is still a possibility that the device stll exists in the guest <danpb> mdbooth: because the actual hotunplug is async <mdbooth> danpb: That explains what the nova libvirt driver is doing, then <danpb> i guess I should have said <danpb> it will only save the changes to the inactive config, if the live config was successfully /requested/ to unplug the device
QA: PLEASE IGNORE COMMENTS 12 TO 38 ENTIRELY! They are unrelated to this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2631