Description of problem: When new VM is created and volume attached to it, attaching the same volume to the VM again produces strange behavior. The second attachment request seems to be trying to detach the previous volume first, but the detachment fails (if these steps are performed quickly one after another). When artifical delay is inserted between these 3 steps(e.g. using pudb), problem doesn't reproduce, which leads me to believe there is race condition in the nova/libvirt area (which might be affecting also several other tests from the same group). It was uncovered by following test in CI (https://github.com/openstack/tempest/blob/master/tempest/api/compute/volumes/test_attach_volume_negative.py#L49): def test_attach_attached_volume_to_same_server(self): server = self.create_test_server(wait_until='ACTIVE') volume = self.create_volume() self.attach_volume(server, volume) self.assertRaises(lib_exc.BadRequest, self.attach_volume, server, volume) Or modified version for better debugging: server = self.create_test_server(wait_until='ACTIVE') volume = self.create_volume() self.attach_volume(server, volume) try: self.attach_volume(server, volume) except: print("Details:", sys.exc_info()[0]) Though BadRequest is raised correctly, Tempest fails anyway as when the volume in "detaching" state fails to detach, it goes to "in-use" state and therefore Tempest can not clean it up (it expects "available" state). /var/log/containers/libvirt/libvirtd.log reports: Successfully detached device vdb from instance dfdb4a0c-6af3-41fe-badd-1d52dda8a1a4 from the persistent domain config error : qemuMonitorJSONCheckErrorFull:418 : internal error: unable to execute QEMU command 'device_del': Device virtio-disk1 is already in the process of unplug Libvirt seems to be able to detach from persisten config, but not from the live one. /var/log/containers/nova/nova-compute.log: ERROR nova.virt.libvirt.driver [req-f44c8447-216b-4dec-b845-de9f908bb42b c9f4a013038e4dd1ab9b119a677744ae 5d0350452d284503bf6644dd38f17ea6 - default default] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance dfdb4a0c-6af3-41fe-badd-1d52dda8a1a4 is timed out. DEBUG nova.virt.libvirt.driver [req-f44c8447-216b-4dec-b845-de9f908bb42b c9f4a013038e4dd1ab9b119a677744ae 5d0350452d284503bf6644dd38f17ea6 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance dfdb4a0c-6af3-41fe-badd-1d52dda8a1a4 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:2397 ... looping 8/8 times...times out... ERROR oslo_messaging.rpc.server nova.exception.DeviceDetachFailed: Device detach failed for vdb: Run out of retry while detaching device vdb with device alias virtio-disk1 from instance dfdb4a0c-6af3-41fe-badd-1d52dda8a1a4 from the live domain config. Device is still attached to the guest Cirros 0.5.2 is used as VM image. Version-Release number of selected component (if applicable): OSP17 Steps to Reproduce: 1) Deploy lvm topology without ceph node 2) python3 -m testtools.run tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_attach_attached_volume_to_same_server Expected results: Volume will detach properly and detachment won't time out Additional info: Container version of nova components: 17.0_20210920.1 in container: openstack-nova-migration-23.0.3-0.20210908132428.e39bbdc.el8ost.noarch openstack-nova-compute-23.0.3-0.20210908132428.e39bbdc.el8ost.noarch openstack-nova-common-23.0.3-0.20210908132428.e39bbdc.el8ost.noarch
Created attachment 1830652 [details] cirros dmesg log
This is not a bug in tempest, based on the following comment in the related upstream bug: https://bugs.launchpad.net/tripleo/+bug/1957031/comments/4 this issue is related to qemu and libvirt and is supposed to be fixed by libvirt 8.0.0
From rechecking the references here, just making sure, it is my understanding that this whole issue was caused by our preliminary testing 8.4 in CI ahead (which ships libvirt 7.0.0) so therefore this is not a product bug, just a consequence of temporary downstream CI configuration? (If we'll be able to confirm it is really fixed with libvirt8.0.0 soon after we have first passing builds reaching Tempest stage).
Waiting to see what qemu do with the dependant BZ before considering what, if anything, needs to be done in Nova.
*** Bug 1977363 has been marked as a duplicate of this bug. ***
I'm unsure why it's a high severity, high priority issue. It was only caught in CI, I don't see customers complaining, it's not in the very regular path (positive) of any flow, etc. The QEMU bug says it's happening during boot, making it even less interesting?
(In reply to Yaniv Kaul from comment #36) > I'm unsure why it's a high severity, high priority issue. It was only caught > in CI, I don't see customers complaining, it's not in the very regular path > (positive) of any flow, etc. > The QEMU bug says it's happening during boot, making it even less > interesting? This is high priority, high severity because it was blocking pretty much all of CI. We've cherry-picked the tempest changes to rhos 17.0, so CI should be unblocked. At this point, this is a qemu issue, once a fix has been released, the Nova team can revisit this to determine whether we need to change anything in Nova.
From tempest perspective this is fixed - https://bugs.launchpad.net/nova/+bug/1960346 is closed as Fix Rleased. There has been a series of patches merged in tempest: https://review.opendev.org/q/topic:wait_until_sshable_pingable Based on the latest comments here, it seems that we're keeping this BZ to figure out what to do in nova after the Depends-On BZ is fixed - therefore I'm gonna change the component of this from openstack-tempest to openstack-nova.
Depends on rhbz was verified with libvirt-9.3.0-1.el9.x86_64 qemu-kvm-8.0.0-2.el9.x86_64 Can you confirm this with openstack-nova ?
I'm not sure why I'm needinfo'ed here. AFAIU, this was reported based on a CI job, so confirming the libvirt fix should be a matter of making sure the job passes with the correct version of libvirt, no?
Not necessarily. Yes, the prerequsite https://bugzilla.redhat.com/show_bug.cgi?id=2087047 is verified, but the angles of timing, concurrency and image-used-dependency (cirros? rhel?) were mentioned here also. So from top level CI overview I remain to be concerned. Is it enough - even though rhbz#2087047 was verified manually in simple environment - to dismiss these? I mean, even if I haven't seen this in the CI for some time, does it mean it's really fixed in the complex OSP environment? I could run a few tests in the loop (various concurrency, distros, ...) over some time to check it, if it was not done by Compute DFG before. In that case I'd need some coop with Compute DFG to build the case here properly to get as close to the problematic area as possible, in the heavy-concurrent OSP env this time. Also there is a documentation angle. My other concern is still the same as above. There was a lot of work done as part of https://review.opendev.org/q/topic:wait_until_sshable_pingable , is it/will it be reflected in the OSP documentation? Maybe us adapting the testing framework to the new need to wait for "SSHABLE" state might not align with customer's expectations/workflows? Artom, since I am needinfo'd here also, can you or your colleagues from Compute DFG help me with these questions?
The original issue was non-deterministic and it was only seen happening with openstack-nova in CI environment. There was an openstack independent reproducer created that was more deterministic to be able to file a qemu bug. In parallel with the qemu fix tempest got the "SSHABLE" improvement to avoid the detach issue. I think the only way to verify the fix https://bugzilla.redhat.com/show_bug.cgi?id=2087047 with openstack-nova would be to remove (or disable) all the "SSHABLE" tempest improvements and let the CI run for a while to see if the detach issue does not appear any more. Do we really want to go this direction?
I understand your concerns, and I'm going to try to unpack them into specific questions that I can address. > I mean, even if I haven't seen this in the CI for some time, does it mean it's really fixed in the complex OSP environment? From a "realpolitik" POV, I would argue that yes, if it doesn't happen in CI (and has not been reported by a customer), we can consider it fixed until new evidence arises that it wasn't. We don't have the resources to chase down potential bugs, we have to concentrate on the very real backlog of bugs that someone, either CI, QE or customer, have observed in the real world. > Also there is a documentation angle. My other concern is still the same as above. There was a lot of work done as part of https://review.opendev.org/q/topic:wait_until_sshable_pingable , is it/will it be reflected in the OSP documentation? Maybe us adapting the testing framework to the new need to wait for "SSHABLE" state might not align with customer's expectations/workflows? My understanding is that this was always a very CI-centric bug, triggered by the combination of the "fake" guest OS (CirrOS) and the rapid succession of operations (boot then attach) in Tempest. I wouldn't be opposed to a blurb in our product docs that explains this caveat, but I'd like the wider Compute team to sanity-check the idea. I'll put it up for discussion and report back.
Gibi wrote in comment # 49 > I think we don't need the docs. That would have only needed before the qemu fix landed. So I think we can agree that at least a Known Issue would be good for 17.0. I'm setting the Doc Type to Known Issue and writing a draft doc text. I can also close this BZ as WONTFIX (because it won't be fixed in 17.0) - the closure should not affect the Known Issue, as Doc's script to pick those up does not look at BZ status.