Bug 2012096
| Summary: | [OSP 17.0] Failed to detach device from the live domain config - when attaching already attached volume | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Filip Hubík <fhubik> | ||||
| Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 17.0 (Wallaby) | CC: | alifshit, amodi, apevec, bdobreli, bgibizer, bkopilov, dasmith, eglynn, jhakimra, joflynn, jparker, kchamart, kthakre, lhh, lyarwood, sbauza, sgordon, slinaber, smooney, udesale, vromanso, wznoinsk | ||||
| Target Milestone: | --- | Keywords: | Triaged | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Known Issue | |||||
| Doc Text: |
Because of a bug in QEMU, device detach requests that are not finished cannot be retried. As a workaround, wait for the guest OS to be fully active before attempting to attach or detach any device, or upgrade to OSP 17.1/RHEP 9.2, in which version the QEMU bug is fixed.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-07-13 14:02:06 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 2087047, 2186397 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Filip Hubík
2021-10-08 09:01:21 UTC
Created attachment 1830652 [details]
cirros dmesg log
This is not a bug in tempest, based on the following comment in the related upstream bug: https://bugs.launchpad.net/tripleo/+bug/1957031/comments/4 this issue is related to qemu and libvirt and is supposed to be fixed by libvirt 8.0.0 From rechecking the references here, just making sure, it is my understanding that this whole issue was caused by our preliminary testing 8.4 in CI ahead (which ships libvirt 7.0.0) so therefore this is not a product bug, just a consequence of temporary downstream CI configuration? (If we'll be able to confirm it is really fixed with libvirt8.0.0 soon after we have first passing builds reaching Tempest stage). Waiting to see what qemu do with the dependant BZ before considering what, if anything, needs to be done in Nova. *** Bug 1977363 has been marked as a duplicate of this bug. *** I'm unsure why it's a high severity, high priority issue. It was only caught in CI, I don't see customers complaining, it's not in the very regular path (positive) of any flow, etc. The QEMU bug says it's happening during boot, making it even less interesting? (In reply to Yaniv Kaul from comment #36) > I'm unsure why it's a high severity, high priority issue. It was only caught > in CI, I don't see customers complaining, it's not in the very regular path > (positive) of any flow, etc. > The QEMU bug says it's happening during boot, making it even less > interesting? This is high priority, high severity because it was blocking pretty much all of CI. We've cherry-picked the tempest changes to rhos 17.0, so CI should be unblocked. At this point, this is a qemu issue, once a fix has been released, the Nova team can revisit this to determine whether we need to change anything in Nova. From tempest perspective this is fixed - https://bugs.launchpad.net/nova/+bug/1960346 is closed as Fix Rleased. There has been a series of patches merged in tempest: https://review.opendev.org/q/topic:wait_until_sshable_pingable Based on the latest comments here, it seems that we're keeping this BZ to figure out what to do in nova after the Depends-On BZ is fixed - therefore I'm gonna change the component of this from openstack-tempest to openstack-nova. Depends on rhbz was verified with libvirt-9.3.0-1.el9.x86_64 qemu-kvm-8.0.0-2.el9.x86_64 Can you confirm this with openstack-nova ? I'm not sure why I'm needinfo'ed here. AFAIU, this was reported based on a CI job, so confirming the libvirt fix should be a matter of making sure the job passes with the correct version of libvirt, no? Not necessarily. Yes, the prerequsite https://bugzilla.redhat.com/show_bug.cgi?id=2087047 is verified, but the angles of timing, concurrency and image-used-dependency (cirros? rhel?) were mentioned here also. So from top level CI overview I remain to be concerned. Is it enough - even though rhbz#2087047 was verified manually in simple environment - to dismiss these? I mean, even if I haven't seen this in the CI for some time, does it mean it's really fixed in the complex OSP environment? I could run a few tests in the loop (various concurrency, distros, ...) over some time to check it, if it was not done by Compute DFG before. In that case I'd need some coop with Compute DFG to build the case here properly to get as close to the problematic area as possible, in the heavy-concurrent OSP env this time. Also there is a documentation angle. My other concern is still the same as above. There was a lot of work done as part of https://review.opendev.org/q/topic:wait_until_sshable_pingable , is it/will it be reflected in the OSP documentation? Maybe us adapting the testing framework to the new need to wait for "SSHABLE" state might not align with customer's expectations/workflows? Artom, since I am needinfo'd here also, can you or your colleagues from Compute DFG help me with these questions? The original issue was non-deterministic and it was only seen happening with openstack-nova in CI environment. There was an openstack independent reproducer created that was more deterministic to be able to file a qemu bug. In parallel with the qemu fix tempest got the "SSHABLE" improvement to avoid the detach issue. I think the only way to verify the fix https://bugzilla.redhat.com/show_bug.cgi?id=2087047 with openstack-nova would be to remove (or disable) all the "SSHABLE" tempest improvements and let the CI run for a while to see if the detach issue does not appear any more. Do we really want to go this direction? I understand your concerns, and I'm going to try to unpack them into specific questions that I can address. > I mean, even if I haven't seen this in the CI for some time, does it mean it's really fixed in the complex OSP environment? From a "realpolitik" POV, I would argue that yes, if it doesn't happen in CI (and has not been reported by a customer), we can consider it fixed until new evidence arises that it wasn't. We don't have the resources to chase down potential bugs, we have to concentrate on the very real backlog of bugs that someone, either CI, QE or customer, have observed in the real world. > Also there is a documentation angle. My other concern is still the same as above. There was a lot of work done as part of https://review.opendev.org/q/topic:wait_until_sshable_pingable , is it/will it be reflected in the OSP documentation? Maybe us adapting the testing framework to the new need to wait for "SSHABLE" state might not align with customer's expectations/workflows? My understanding is that this was always a very CI-centric bug, triggered by the combination of the "fake" guest OS (CirrOS) and the rapid succession of operations (boot then attach) in Tempest. I wouldn't be opposed to a blurb in our product docs that explains this caveat, but I'd like the wider Compute team to sanity-check the idea. I'll put it up for discussion and report back. Gibi wrote in comment # 49 > I think we don't need the docs. That would have only needed before the qemu fix landed. So I think we can agree that at least a Known Issue would be good for 17.0. I'm setting the Doc Type to Known Issue and writing a draft doc text. I can also close this BZ as WONTFIX (because it won't be fixed in 17.0) - the closure should not affect the Known Issue, as Doc's script to pick those up does not look at BZ status. |