1977363 – [RHOS 16.2] Test fails to detach an attached volume from a q35 based instance

Bug 1977363 - [RHOS 16.2] Test fails to detach an attached volume from a q35 based instance

Summary: [RHOS 16.2] Test fails to detach an attached volume from a q35 based instance

Keywords:
Status:	CLOSED DUPLICATE of bug 2012096
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-29 14:13 UTC by Archit Modi
Modified:	2023-03-28 11:26 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-28 09:03:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-5616	0	None	None	None	2021-11-10 15:56:15 UTC

Description Archit Modi 2021-06-29 14:13:08 UTC

Description of problem: Running the following tempest test results in failure to detach the volume which was attached:
"tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_attach_attached_volume_to_same_server"

tempest url: https://github.com/openstack/tempest/blob/master/tempest/api/compute/volumes/test_attach_volume_negative.py#L49


Version-Release number of selected component (if applicable): RHOS-16.2


How reproducible: always


Steps to Reproduce:
1. Deploy RHOS-16.2 on AMD SEV enabled computes
2. Run the tempest test "test_attach_attached_volume_to_same_server"
3. Test fails after a timeout of 300 seconds

Actual results:
{1} tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_attach_attached_volume_to_same_server [309.578348s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "/home/stack/.virtualenvs/.tempest/lib64/python3.6/site-packages/tempest/common/waiters.py", line 288, in wait_for_volume_resource_status
    raise lib_exc.TimeoutException(message)

    tempest.lib.exceptions.TimeoutException: Request timed out
Details: volume 23863996-dd56-4663-9fe9-11e77775ec13 failed to reach available status (current in-use) within the required time (300 s).
~~~
~~~
    testtools.runtest.MultipleExceptions: ((<class 'tempest.lib.exceptions.BadRequest'>, Bad request
Details: {'code': 400, 'message': 'Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer.'}, <traceback object at 0x7f660eb71d08>), (<class 'tempest.lib.exceptions.TimeoutException'>, Request timed out                                                                                       
Details: (AttachVolumeNegativeTest:tearDownClass) Failed to delete volume 0b47fe6d-a383-47e7-949e-f0777690d158 within the required time (300 s)., <traceback object at 0x7f660e9b3b48>))

Expected results:
Test passed

Additional info:

Comment 3 Stephen Finucane 2021-06-30 09:18:56 UTC

Could we get the full logs for the nova-compute service and, if possible, the libvirt and QEMU logs for this instance? This looks and smells like launchpad bug #1882521 [1] from upstream. In the logs you have provided, we can see:

  ./nova-compute.log:2021-06-28 21:18:27.994 7 WARNING nova.virt.block_device [req-d6846857-5bdb-40c0-bf6a-ac46918a01f7 dbdda0b87b394748830eea54d9b0ecee dd9fa402c56d487abb29e6cb20e8562d - default default] [instance: 9a4bf1ec-7ced-4590-906a-cb17fc9d8efb] Guest refused to detach volume 0b47fe6d-a383-47e7-949e-f0777690d158: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.

We also see:

  ./nova-compute.log.1:2021-06-28 17:49:24.905 7 ERROR os_brick.initiator.linuxscsi [req-c4f1a0c0-032b-4148-9220-1a6d17ea9e6d d78290c797934b14ba2d7b3d53b22886 a6491f68863d4c37875914680325c773 - default default] multipathd is not running: exit code 1: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
  ./nova-compute.log.1:2021-06-28 18:01:21.290 7 ERROR os_brick.initiator.linuxscsi [req-d0fe6a7e-864d-4eed-b388-bf8352ff3fd4 ee02b76b0f8649619f8dc9460b8ac150 0d9f2ad13e7242798aa2984201f2734d - default default] multipathd is not running: exit code 1: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

but I suspect that is a red herring and unrelated to the actual failure since they occur some time before the actual test executes. I wonder what does Lee think?

[1] https://bugs.launchpad.net/nova/+bug/1882521

Comment 5 Lee Yarwood 2021-07-01 11:01:36 UTC

multipathd is not running comes from os-brick [1] and is ignored if we don't require multipath, I'll fix this up now [2].

Looking at the logs from the RHEL 8.4 guest image run now.

[1] https://github.com/openstack/os-brick/blob/736454730fd66946bd46a95f612388189cbf3cfb/os_brick/initiator/linuxscsi.py#L222
[2] https://review.opendev.org/c/openstack/os-brick/+/799035

Comment 16 Artom Lifshitz 2021-10-25 15:59:42 UTC

Double-check whether this is a duplicate of the CIX https://bugzilla.redhat.com/show_bug.cgi?id=2012096.

Comment 23 Igor Mammedov 2022-02-17 08:30:32 UTC

Kashyap,

There were number of improvements for both native PCI-E (albeit it still slow to react (due to how it's implemented in guest OS)
and now q35 supports ACPI base hotplug, can you check with latest machine type (which supposedly should use ACPI hotplug)
and see if it resolved the issue.

Comment 24 Kashyap Chamarthy 2022-02-23 14:33:57 UTC

(In reply to Igor Mammedov from comment #23)
> Kashyap,
> 
> There were number of improvements for both native PCI-E (albeit it still
> slow to react (due to how it's implemented in guest OS)
> and now q35 supports ACPI base hotplug, can you check with latest machine
> type (which supposedly should use ACPI hotplug)
> and see if it resolved the issue.

@Igor, by "latest machine type", you mean the RHEL 8.4?  Or upstream, or something else?

Comment 25 Igor Mammedov 2022-02-23 17:13:55 UTC

(In reply to Kashyap Chamarthy from comment #24)
> (In reply to Igor Mammedov from comment #23)
> > Kashyap,
> > 
> > There were number of improvements for both native PCI-E (albeit it still
> > slow to react (due to how it's implemented in guest OS)
> > and now q35 supports ACPI base hotplug, can you check with latest machine
> > type (which supposedly should use ACPI hotplug)
> > and see if it resolved the issue.
> 
> @Igor, by "latest machine type", you mean the RHEL 8.4?  Or upstream, or
> something else?

RHEL 8.6/9.0 ones

Comment 26 Kashyap Chamarthy 2022-03-02 14:57:25 UTC

(In reply to Igor Mammedov from comment #25)
> (In reply to Kashyap Chamarthy from comment #24)
> > (In reply to Igor Mammedov from comment #23)
> > > Kashyap,
> > > 
> > > There were number of improvements for both native PCI-E (albeit it still
> > > slow to react (due to how it's implemented in guest OS)
> > > and now q35 supports ACPI base hotplug, can you check with latest machine
> > > type (which supposedly should use ACPI hotplug)
> > > and see if it resolved the issue.
> > 
> > @Igor, by "latest machine type", you mean the RHEL 8.4?  Or upstream, or
> > something else?
> 
> RHEL 8.6/9.0 ones


Hi, Igor.  OSP 16.2 is "fixed" to RHEL 8.4.  And there is no supported OSP release on RHEL 8.6

The next in-progress OSP (17) release is on RHEL 9.  Let me check with our fine QE if they can test 16.2 on RHEL 8.6, or if there are OSP 17 builds yet to test this on RHEL9.

That said, the buggy behavior remains on RHEL 8.4.  So if we're able to track down that this problem, then we should do a QEMU bisect and then backport the commits.

@jparker: Can we please test OSP 16.2 with RHEL 8.6?  (And note the QEMU version of RHEL 8.6 used for testing here.)

Comment 32 Igor Mammedov 2022-03-29 08:10:51 UTC

looking at command line, it should be using ACPI-based PCI hotplug.
The only difference vs configuration it was tested with is SEV options.

Can you check if unplug works with non encrypted guest configuration?
(also make sure that guest is fully booted before trying to unplug)

Comment 33 Igor Mammedov 2022-03-29 08:21:50 UTC

Another thing to check is: if hotplug works at all, (i.e. try to hotadd ad virtio-blk-pci disk after guest is booted)

Comment 41 Balazs Gibizer 2022-06-28 09:03:23 UTC


*** This bug has been marked as a duplicate of bug 2012096 ***

Note You need to log in before you can comment on or make changes to this bug.