Bug 1669225
| Summary: | [BACKPORT Request] Nova returns a traceback when it's unable to detach a volume still in use | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | David Vallee Delisle <dvd> |
| Component: | openstack-nova | Assignee: | Matthew Booth <mbooth> |
| Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 10.0 (Newton) | CC: | cmuresan, dasmith, dcadzow, eglynn, gkadam, jhakimra, jhardee, kchamart, lyarwood, mbooth, mburns, mwitt, ramishra, sbaker, sbauza, sgordon, shardy, vromanso |
| Target Milestone: | async | Keywords: | Patch, Triaged, ZStream |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-nova-14.1.0-54.el7ost | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1551733 | Environment: | |
| Last Closed: | 2019-09-03 16:36:34 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1551733 | ||
| Bug Blocks: | 1557938 | ||
|
Description
David Vallee Delisle
2019-01-24 16:16:41 UTC
There's nothing here specific to heat, so I'm going to simplify the reproducer steps to Nova: * Create an instance with a volume attached * Log in to the instance and mount the volume * Detach the volume Confirm: * Nova logs 'Guest refused to detach volume' * Volume returns to in-use and is no longer detaching. Initial testing note: Cirros doesn't appear to prevent removal of the volume even when it's mounted with a process actively writing to an open file descriptor. Can't immediately reproduce. I have retried this with a CentOS 7 image, and I am still unable to make the qemu refuse to detach the disk. This is what I tried:
[stack@undercloud-0 ~]$ openstack server create --image CentOS-7-x86_64-GenericCloud.qcow2c --flavor m1.me
dium --nic net-id=private --key-name defaultkey testinstance1
+--------------------------------------+-----------------------------------------------------------------+
| Field | Value |
+--------------------------------------+-----------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| adminPass | NwC35m3Cy5vj |
| config_drive | |
| created | 2019-06-17T10:32:32Z |
| flavor | m1.medium (1582e22e-a904-4b18-aa0c-0d1b094d3224) |
| hostId | |
| id | 8b3c4755-d652-4cad-be83-6d0d064cec7e |
| image | CentOS-7-x86_64-GenericCloud.qcow2c (fdda862a-957a- |
| | 4b23-8fb1-fbd98b192913) |
| key_name | defaultkey |
| name | testinstance1 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| project_id | 9f8895b7cb374eae8531bd7f04276447 |
| properties | |
| security_groups | [{u'name': u'default'}] |
| status | BUILD |
| updated | 2019-06-17T10:32:33Z |
| user_id | 837fb6d1122e410fb9c93a21002207c9 |
+--------------------------------------+-----------------------------------------------------------------+
[stack@undercloud-0 ~]$ openstack server add floating ip 8b3c4755-d652-4cad-be83-6d0d064cec7e 10.0.0.215
[stack@undercloud-0 ~]$ openstack volume create --size 1 testvol1
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-06-17T10:38:48.180819 |
| description | None |
| encrypted | False |
| id | 08284d9a-cf7a-4af7-8da4-86eb511d2464 |
| migration_status | None |
| multiattach | False |
| name | testvol1 |
| properties | |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| type | None |
| updated_at | None |
| user_id | 837fb6d1122e410fb9c93a21002207c9 |
+---------------------+--------------------------------------+
[stack@undercloud-0 ~]$ ssh centos.0.215
Warning: Permanently added '10.0.0.215' (ECDSA) to the list of known hosts.
[centos@testinstance1 ~]$ sudo su -
[root@testinstance1 ~]# mke2fs /dev/vdb
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376
Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
[root@testinstance1 ~]# mount /dev/vdb /mnt
[root@testinstance1 ~]# cd /mnt
[root@testinstance1 mnt]# touch foo
[stack@undercloud-0 ~]$ openstack server remove volume 8b3c4755-d652-4cad-be83-6d0d064cec7e 08284d9a-cf7a-4af7-8da4-86eb511d2464
[stack@undercloud-0 ~]$ openstack volume list
+--------------------------------------+--------------+-----------+------+-------------+
| ID | Display Name | Status | Size | Attached to |
+--------------------------------------+--------------+-----------+------+-------------+
| 08284d9a-cf7a-4af7-8da4-86eb511d2464 | testvol1 | available | 1 | |
+--------------------------------------+--------------+-----------+------+-------------+
[root@testinstance1 mnt]# ls
ls: reading directory .: Input/output error
I also tried again with a process writing data to the volume:
# while true; do date; sleep 1; done | tee foo
The result was the same: the volume was removed. dmesg in the guest does not note any activity prior to the removal of the device.
[ 418.589239] pci 0000:00:06.0: [1af4:1001] type 00 class 0x010000
[ 418.590872] pci 0000:00:06.0: reg 0x10: [io 0x0000-0x003f]
[ 418.591447] pci 0000:00:06.0: reg 0x14: [mem 0x00000000-0x00000fff]
[ 418.593578] pci 0000:00:06.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
[ 418.603881] pci 0000:00:06.0: BAR 4: assigned [mem 0x100000000-0x100003fff 64bit pref]
[ 418.616969] pci 0000:00:06.0: BAR 1: assigned [mem 0x80000000-0x80000fff]
[ 418.628270] pci 0000:00:06.0: BAR 0: assigned [io 0x1000-0x103f]
[ 418.653220] virtio-pci 0000:00:06.0: enabling device (0000 -> 0003)
[ 418.701889] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11
[ 418.724575] virtio-pci 0000:00:06.0: irq 29 for MSI/MSI-X
[ 418.724793] virtio-pci 0000:00:06.0: irq 30 for MSI/MSI-X
[ 418.729286] virtio_blk virtio3: [vdb] 2097152 512-byte logical blocks (1.07 GB/1.00 GiB)
[ 522.679222] EXT4-fs (vdb): mounting ext2 file system using the ext4 subsystem
[ 522.712497] EXT4-fs (vdb): mounted filesystem without journal. Opts: (null)
[ 732.365265] EXT4-fs warning (device vdb): __ext4_read_dirblock:902: error reading directory block (ino
2, block 0)
[ 732.378969] EXT4-fs error (device vdb): __ext4_get_inode_loc:4239: inode #2: block 67: comm ls: unable
to read itable block
[ 732.393652] EXT4-fs error (device vdb) in ext4_reserve_inode_write:5246: IO failure
[ 762.448359] EXT4-fs error (device vdb): __ext4_get_inode_loc:4239: inode #2: block 67: comm kworker/u2:
2: unable to read itable block
Note that this backport requires an additional backport of https://review.opendev.org/#/c/590439/ or the volume will not return to the in-use state even if we do work out how to reproduce this. Here's a local demonstration of the issue:
=== Define a local disk
# cat disk.xml
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/test2.qcow2'/>
<backingStore/>
<target dev='vdb' bus='virtio'/>
</disk>
=== Attach it to a test domain
# virsh attach-device test --live --config disk.xml
Device attached successfully
=== Detach the config only -> SUCCESS
# virsh detach-device test --config disk.xml
Device detached successfully
=== Attempt to detach config again -> no target device
# virsh detach-device test --config disk.xml
error: Failed to detach device from disk.xml
error: device not found: no target device vdb
=== Attempt to detach both live and config -> no target device
# virsh detach-device test --live --config disk.xml
error: Failed to detach device from disk.xml
error: device not found: no target device vdb
=== Attempt to detach live only -> SUCCESS
# virsh detach-device test --live disk.xml
Device detached successfully
=== Attempt to detach live again -> disk not found
# virsh detach-device test --live disk.xml
error: Failed to detach device from disk.xml
error: operation failed: disk vdb not found
https://review.opendev.org/#/c/584433/ removed the reference to the transient domain upstream. <danpb> mdbooth: it will only save the changes to the inactive config, if the live config was successfully updated <mdbooth> danpb: And it's synchronous, right? i.e. if it returns without an error that means it has succeeded? <mdbooth> Returns 0 in case of success, -1 in case of failure. <mdbooth> success == device *has been removed*, not *has been scheduled for removal, which may fail later* <danpb> mdbooth: no, if it returns success there is still a possibility that the device stll exists in the guest <danpb> mdbooth: because the actual hotunplug is async <mdbooth> danpb: That explains what the nova libvirt driver is doing, then <danpb> i guess I should have said <danpb> it will only save the changes to the inactive config, if the live config was successfully /requested/ to unplug the device QA: PLEASE IGNORE COMMENTS 12 TO 38 ENTIRELY! They are unrelated to this bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2631 |