Description of problem: Running the following tempest test results in failure to detach the volume which was attached: "tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_attach_attached_volume_to_same_server" tempest url: https://github.com/openstack/tempest/blob/master/tempest/api/compute/volumes/test_attach_volume_negative.py#L49 Version-Release number of selected component (if applicable): RHOS-16.2 How reproducible: always Steps to Reproduce: 1. Deploy RHOS-16.2 on AMD SEV enabled computes 2. Run the tempest test "test_attach_attached_volume_to_same_server" 3. Test fails after a timeout of 300 seconds Actual results: {1} tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_attach_attached_volume_to_same_server [309.578348s] ... FAILED Captured traceback: ~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "/home/stack/.virtualenvs/.tempest/lib64/python3.6/site-packages/tempest/common/waiters.py", line 288, in wait_for_volume_resource_status raise lib_exc.TimeoutException(message) tempest.lib.exceptions.TimeoutException: Request timed out Details: volume 23863996-dd56-4663-9fe9-11e77775ec13 failed to reach available status (current in-use) within the required time (300 s). ~~~ ~~~ testtools.runtest.MultipleExceptions: ((<class 'tempest.lib.exceptions.BadRequest'>, Bad request Details: {'code': 400, 'message': 'Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer.'}, <traceback object at 0x7f660eb71d08>), (<class 'tempest.lib.exceptions.TimeoutException'>, Request timed out Details: (AttachVolumeNegativeTest:tearDownClass) Failed to delete volume 0b47fe6d-a383-47e7-949e-f0777690d158 within the required time (300 s)., <traceback object at 0x7f660e9b3b48>)) Expected results: Test passed Additional info:
Could we get the full logs for the nova-compute service and, if possible, the libvirt and QEMU logs for this instance? This looks and smells like launchpad bug #1882521 [1] from upstream. In the logs you have provided, we can see: ./nova-compute.log:2021-06-28 21:18:27.994 7 WARNING nova.virt.block_device [req-d6846857-5bdb-40c0-bf6a-ac46918a01f7 dbdda0b87b394748830eea54d9b0ecee dd9fa402c56d487abb29e6cb20e8562d - default default] [instance: 9a4bf1ec-7ced-4590-906a-cb17fc9d8efb] Guest refused to detach volume 0b47fe6d-a383-47e7-949e-f0777690d158: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config. We also see: ./nova-compute.log.1:2021-06-28 17:49:24.905 7 ERROR os_brick.initiator.linuxscsi [req-c4f1a0c0-032b-4148-9220-1a6d17ea9e6d d78290c797934b14ba2d7b3d53b22886 a6491f68863d4c37875914680325c773 - default default] multipathd is not running: exit code 1: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. ./nova-compute.log.1:2021-06-28 18:01:21.290 7 ERROR os_brick.initiator.linuxscsi [req-d0fe6a7e-864d-4eed-b388-bf8352ff3fd4 ee02b76b0f8649619f8dc9460b8ac150 0d9f2ad13e7242798aa2984201f2734d - default default] multipathd is not running: exit code 1: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. but I suspect that is a red herring and unrelated to the actual failure since they occur some time before the actual test executes. I wonder what does Lee think? [1] https://bugs.launchpad.net/nova/+bug/1882521
multipathd is not running comes from os-brick [1] and is ignored if we don't require multipath, I'll fix this up now [2]. Looking at the logs from the RHEL 8.4 guest image run now. [1] https://github.com/openstack/os-brick/blob/736454730fd66946bd46a95f612388189cbf3cfb/os_brick/initiator/linuxscsi.py#L222 [2] https://review.opendev.org/c/openstack/os-brick/+/799035
Double-check whether this is a duplicate of the CIX https://bugzilla.redhat.com/show_bug.cgi?id=2012096.
Kashyap, There were number of improvements for both native PCI-E (albeit it still slow to react (due to how it's implemented in guest OS) and now q35 supports ACPI base hotplug, can you check with latest machine type (which supposedly should use ACPI hotplug) and see if it resolved the issue.
(In reply to Igor Mammedov from comment #23) > Kashyap, > > There were number of improvements for both native PCI-E (albeit it still > slow to react (due to how it's implemented in guest OS) > and now q35 supports ACPI base hotplug, can you check with latest machine > type (which supposedly should use ACPI hotplug) > and see if it resolved the issue. @Igor, by "latest machine type", you mean the RHEL 8.4? Or upstream, or something else?
(In reply to Kashyap Chamarthy from comment #24) > (In reply to Igor Mammedov from comment #23) > > Kashyap, > > > > There were number of improvements for both native PCI-E (albeit it still > > slow to react (due to how it's implemented in guest OS) > > and now q35 supports ACPI base hotplug, can you check with latest machine > > type (which supposedly should use ACPI hotplug) > > and see if it resolved the issue. > > @Igor, by "latest machine type", you mean the RHEL 8.4? Or upstream, or > something else? RHEL 8.6/9.0 ones
(In reply to Igor Mammedov from comment #25) > (In reply to Kashyap Chamarthy from comment #24) > > (In reply to Igor Mammedov from comment #23) > > > Kashyap, > > > > > > There were number of improvements for both native PCI-E (albeit it still > > > slow to react (due to how it's implemented in guest OS) > > > and now q35 supports ACPI base hotplug, can you check with latest machine > > > type (which supposedly should use ACPI hotplug) > > > and see if it resolved the issue. > > > > @Igor, by "latest machine type", you mean the RHEL 8.4? Or upstream, or > > something else? > > RHEL 8.6/9.0 ones Hi, Igor. OSP 16.2 is "fixed" to RHEL 8.4. And there is no supported OSP release on RHEL 8.6 The next in-progress OSP (17) release is on RHEL 9. Let me check with our fine QE if they can test 16.2 on RHEL 8.6, or if there are OSP 17 builds yet to test this on RHEL9. That said, the buggy behavior remains on RHEL 8.4. So if we're able to track down that this problem, then we should do a QEMU bisect and then backport the commits. @jparker: Can we please test OSP 16.2 with RHEL 8.6? (And note the QEMU version of RHEL 8.6 used for testing here.)
looking at command line, it should be using ACPI-based PCI hotplug. The only difference vs configuration it was tested with is SEV options. Can you check if unplug works with non encrypted guest configuration? (also make sure that guest is fully booted before trying to unplug)
Another thing to check is: if hotplug works at all, (i.e. try to hotadd ad virtio-blk-pci disk after guest is booted)
*** This bug has been marked as a duplicate of bug 2012096 ***