Bug 2012096 - [OSP 17.0] Failed to detach device from the live domain config - when attaching already attached volume
Summary: [OSP 17.0] Failed to detach device from the live domain config - when attachi...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
: 1977363 (view as bug list)
Depends On: 2087047 2186397
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-08 09:01 UTC by Filip Hubík
Modified: 2023-09-28 17:00 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Because of a bug in QEMU, device detach requests that are not finished cannot be retried. As a workaround, wait for the guest OS to be fully active before attempting to attach or detach any device, or upgrade to OSP 17.1/RHEP 9.2, in which version the QEMU bug is fixed.
Clone Of:
Environment:
Last Closed: 2023-07-13 14:02:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cirros dmesg log (54.94 KB, text/plain)
2021-10-08 09:09 UTC, Filip Hubík
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1957031 0 None None None 2022-01-12 12:07:19 UTC
Launchpad 1960346 0 None None None 2022-07-04 15:45:18 UTC
Red Hat Issue Tracker OSP-10285 0 None None None 2021-11-10 14:27:02 UTC

Internal Links: 2016327

Description Filip Hubík 2021-10-08 09:01:21 UTC
Description of problem:
When new VM is created and volume attached to it, attaching the same volume to the VM again produces strange behavior. The second attachment request seems to be trying to detach the previous volume first, but the detachment fails (if these steps are performed quickly one after another). When artifical delay is inserted between these 3 steps(e.g. using pudb), problem doesn't reproduce, which leads me to believe there is race condition in the nova/libvirt area (which might be affecting also several other tests from the same group).

It was uncovered by following test in CI (https://github.com/openstack/tempest/blob/master/tempest/api/compute/volumes/test_attach_volume_negative.py#L49):

def test_attach_attached_volume_to_same_server(self):
        server = self.create_test_server(wait_until='ACTIVE')
        volume = self.create_volume()
        self.attach_volume(server, volume)
        self.assertRaises(lib_exc.BadRequest,
                          self.attach_volume, server, volume)

Or modified version for better debugging:
        server = self.create_test_server(wait_until='ACTIVE')
        volume = self.create_volume()
        self.attach_volume(server, volume)
        try:
            self.attach_volume(server, volume)
        except:
            print("Details:", sys.exc_info()[0])

Though BadRequest is raised correctly, Tempest fails anyway as when the volume in "detaching" state fails to detach, it goes to "in-use" state and therefore Tempest can not clean it up (it expects "available" state).

/var/log/containers/libvirt/libvirtd.log reports:
Successfully detached device vdb from instance dfdb4a0c-6af3-41fe-badd-1d52dda8a1a4 from the persistent domain config
error : qemuMonitorJSONCheckErrorFull:418 : internal error: unable to execute QEMU command 'device_del': Device virtio-disk1 is already in the process of unplug

Libvirt seems to be able to detach from persisten config, but not from the live one.

/var/log/containers/nova/nova-compute.log:
ERROR nova.virt.libvirt.driver [req-f44c8447-216b-4dec-b845-de9f908bb42b c9f4a013038e4dd1ab9b119a677744ae 5d0350452d284503bf6644dd38f17ea6 - default default] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance dfdb4a0c-6af3-41fe-badd-1d52dda8a1a4 is timed out.
DEBUG nova.virt.libvirt.driver [req-f44c8447-216b-4dec-b845-de9f908bb42b c9f4a013038e4dd1ab9b119a677744ae 5d0350452d284503bf6644dd38f17ea6 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance dfdb4a0c-6af3-41fe-badd-1d52dda8a1a4 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:2397
... looping 8/8 times...times out...
ERROR oslo_messaging.rpc.server nova.exception.DeviceDetachFailed: Device detach failed for vdb: Run out of retry while detaching device vdb with device alias virtio-disk1 from instance dfdb4a0c-6af3-41fe-badd-1d52dda8a1a4 from the live domain config. Device is still attached to the guest

Cirros 0.5.2 is used as VM image.

Version-Release number of selected component (if applicable):
OSP17

Steps to Reproduce:
1) Deploy lvm topology without ceph node
2) python3 -m testtools.run tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_attach_attached_volume_to_same_server

Expected results:
Volume will detach properly and detachment won't time out

Additional info:
Container version of nova components: 17.0_20210920.1
in container:
openstack-nova-migration-23.0.3-0.20210908132428.e39bbdc.el8ost.noarch
openstack-nova-compute-23.0.3-0.20210908132428.e39bbdc.el8ost.noarch
openstack-nova-common-23.0.3-0.20210908132428.e39bbdc.el8ost.noarch

Comment 3 Filip Hubík 2021-10-08 09:09:47 UTC
Created attachment 1830652 [details]
cirros dmesg log

Comment 10 Martin Kopec 2022-01-31 08:35:05 UTC
This is not a bug in tempest, based on the following comment in the related upstream bug:
https://bugs.launchpad.net/tripleo/+bug/1957031/comments/4
this issue is related to qemu and libvirt and is supposed to be fixed by libvirt 8.0.0

Comment 11 Filip Hubík 2022-03-15 14:27:22 UTC
From rechecking the references here, just making sure, it is my understanding that this whole issue was caused by our preliminary testing 8.4 in CI ahead (which ships libvirt 7.0.0) so therefore this is not a product bug, just a consequence of temporary downstream CI configuration? (If we'll be able to confirm it is really fixed with libvirt8.0.0 soon after we have first passing builds reaching Tempest stage).

Comment 34 Artom Lifshitz 2022-05-18 15:37:11 UTC
Waiting to see what qemu do with the dependant BZ before considering what, if anything, needs to be done in Nova.

Comment 35 Balazs Gibizer 2022-06-28 09:03:23 UTC
*** Bug 1977363 has been marked as a duplicate of this bug. ***

Comment 36 Yaniv Kaul 2022-06-29 09:23:00 UTC
I'm unsure why it's a high severity, high priority issue. It was only caught in CI, I don't see customers complaining, it's not in the very regular path (positive) of any flow, etc.
The QEMU bug says it's happening during boot, making it even less interesting?

Comment 37 Artom Lifshitz 2022-07-04 15:37:39 UTC
(In reply to Yaniv Kaul from comment #36)
> I'm unsure why it's a high severity, high priority issue. It was only caught
> in CI, I don't see customers complaining, it's not in the very regular path
> (positive) of any flow, etc.
> The QEMU bug says it's happening during boot, making it even less
> interesting?

This is high priority, high severity because it was blocking pretty much all of CI. We've cherry-picked the tempest changes to rhos 17.0, so CI should be unblocked. At this point, this is a qemu issue, once a fix has been released, the Nova team can revisit this to determine whether we need to change anything in Nova.

Comment 38 Martin Kopec 2022-07-27 16:46:32 UTC
From tempest perspective this is fixed - https://bugs.launchpad.net/nova/+bug/1960346 is closed as Fix Rleased. There has been a series of patches merged in tempest: https://review.opendev.org/q/topic:wait_until_sshable_pingable

Based on the latest comments here, it seems that we're keeping this BZ to figure out what to do in nova after the Depends-On BZ is fixed - therefore I'm gonna change the component of this from openstack-tempest to openstack-nova.

Comment 44 Alan Pevec 2023-06-28 14:55:38 UTC
Depends on rhbz was verified with libvirt-9.3.0-1.el9.x86_64 qemu-kvm-8.0.0-2.el9.x86_64

Can you confirm this with openstack-nova ?

Comment 45 Artom Lifshitz 2023-06-30 16:11:29 UTC
I'm not sure why I'm needinfo'ed here. AFAIU, this was reported based on a CI job, so confirming the libvirt fix should be a matter of making sure the job passes with the correct version of libvirt, no?

Comment 46 Filip Hubík 2023-07-11 09:51:35 UTC
Not necessarily. Yes, the prerequsite https://bugzilla.redhat.com/show_bug.cgi?id=2087047 is verified, but the angles of timing, concurrency and image-used-dependency (cirros? rhel?) were mentioned here also. So from top level CI overview I remain to be concerned. Is it enough - even though rhbz#2087047 was verified manually in simple environment - to dismiss these? I mean, even if I haven't seen this in the CI for some time, does it mean it's really fixed in the complex OSP environment? I could run a few tests in the loop (various concurrency, distros, ...) over some time to check it, if it was not done by Compute DFG before. In that case I'd need some coop with Compute DFG to build the case here properly to get as close to the problematic area as possible, in the heavy-concurrent OSP env this time.

Also there is a documentation angle. My other concern is still the same as above. There was a lot of work done as part of https://review.opendev.org/q/topic:wait_until_sshable_pingable , is it/will it be reflected in the OSP documentation? Maybe us adapting the testing framework to the new need to wait for "SSHABLE" state might not align with customer's expectations/workflows?

Artom, since I am needinfo'd here also, can you or your colleagues from Compute DFG help me with these questions?

Comment 47 Balazs Gibizer 2023-07-11 10:17:50 UTC
The original issue was non-deterministic and it was only seen happening with openstack-nova in CI environment. There was an openstack independent reproducer created that was more deterministic to be able to file a qemu bug. In parallel with the qemu fix tempest got the "SSHABLE" improvement to avoid the detach issue.

I think the only way to verify the fix https://bugzilla.redhat.com/show_bug.cgi?id=2087047 with openstack-nova would be to remove (or disable) all the "SSHABLE" tempest improvements and let the CI run for a while to see if the detach issue does not appear any more. Do we really want to go this direction?

Comment 48 Artom Lifshitz 2023-07-11 10:40:21 UTC
I understand your concerns, and I'm going to try to unpack them into specific questions that I can address.

> I mean, even if I haven't seen this in the CI for some time, does it mean it's really fixed in the complex OSP environment?

From a "realpolitik" POV, I would argue that yes, if it doesn't happen in CI (and has not been reported by a customer), we can consider it fixed until new evidence arises that it wasn't. We don't have the resources to chase down potential bugs, we have to concentrate on the very real backlog of bugs that someone, either CI, QE or customer, have observed in the real world.

> Also there is a documentation angle. My other concern is still the same as above. There was a lot of work done as part of https://review.opendev.org/q/topic:wait_until_sshable_pingable , is it/will it be reflected in the OSP documentation? Maybe us adapting the testing framework to the new need to wait for "SSHABLE" state might not align with customer's expectations/workflows?

My understanding is that this was always a very CI-centric bug, triggered by the combination of the "fake" guest OS (CirrOS) and the rapid succession of operations (boot then attach) in Tempest. I wouldn't be opposed to a blurb in our product docs that explains this caveat, but I'd like the wider Compute team to sanity-check the idea. I'll put it up for discussion and report back.

Comment 53 Artom Lifshitz 2023-07-13 14:02:06 UTC
Gibi wrote in comment # 49

> I think we don't need the docs. That would have only needed before the qemu fix landed.

So I think we can agree that at least a Known Issue would be good for 17.0. I'm setting the Doc Type to Known Issue and writing a draft doc text.

I can also close this BZ as WONTFIX (because it won't be fixed in 17.0) - the closure should not affect the Known Issue, as Doc's script to pick those up does not look at BZ status.


Note You need to log in before you can comment on or make changes to this bug.