This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 2230602 - [RHOS-17.1] Resizing compute with NVMeOF Cinder backend intermittently fails to find volume
Summary: [RHOS-17.1] Resizing compute with NVMeOF Cinder backend intermittently fails ...
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-os-brick
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Cinder Bugs List
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-09 17:37 UTC by James Parker
Modified: 2025-01-09 21:56 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2025-01-09 21:55:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 2035375 0 None None None 2023-09-15 13:15:54 UTC
Launchpad 2035695 0 None None None 2023-09-15 13:15:54 UTC
Launchpad 2035911 0 None None None 2023-09-15 13:15:54 UTC
OpenStack gerrit 895192 0 None MERGED Fix guard for NVMeOF volumes 2024-12-10 14:59:21 UTC
OpenStack gerrit 895193 0 None MERGED NVMe-oF: Fix attach when reconnecting 2024-12-10 14:59:25 UTC
Red Hat Issue Tracker OSP-27370 0 None None None 2023-08-09 17:37:53 UTC
Red Hat Issue Tracker OSP-33369 0 None None None 2025-01-09 21:56:58 UTC
Red Hat Issue Tracker   OSPRH-12844 0 None None None 2025-01-09 21:55:45 UTC

Description James Parker 2023-08-09 17:37:11 UTC
Description of problem: Resizing a guest with NVMeOF Cinder backend in intermittently failing in phase3 CI.  Test specifically triggering this failure [1].  Full logs, deployment details, and job details will be included in follow up comment.  Please advise if the attempted actions are unsupported for the environment.

2023-08-03 02:01:55.660 18 DEBUG placement.objects.research_context [req-37c55e62-a4c1-41d6-be37-e83b9dbe7364 d296552183034fba8b15267e5b19b68d ddccfaba37ea41a4bc85a4a418f8f9e2 - default default] found 2 providers after filtering by previous result get_provider_ids_matching /usr/lib/python3.9/site-packages/placement/objects/research_context.py:580
2023-08-03 02:01:55.661 2 DEBUG os_brick.initiator.connectors.base [req-a9e03d62-5284-4fbf-bdc8-c2bcc78d4703 571ccebe0b70413691998e6cac3046ca e40a3908acf04d27b9e8f4bbbe5005ff - default default] Lock "connect_volume" "released" by "os_brick.initiator.connectors.nvmeof.NVMeOFConnector.connect_volume" :: held 6.036s inner /usr/lib/python3.9/site-packages/os_brick/initiator/connectors/base.py:83
2023-08-03 02:01:55.661 2 DEBUG os_brick.initiator.connectors.nvmeof [req-a9e03d62-5284-4fbf-bdc8-c2bcc78d4703 571ccebe0b70413691998e6cac3046ca e40a3908acf04d27b9e8f4bbbe5005ff - default default] <== connect_volume: exception (6037ms) VolumeDeviceNotFound('Volume device not found at nqn.nvme-subsystem-compute-0.redhat.local.') trace_logging_wrapper /usr/lib/python3.9/site-packages/os_brick/utils.py:176
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [req-a9e03d62-5284-4fbf-bdc8-c2bcc78d4703 571ccebe0b70413691998e6cac3046ca e40a3908acf04d27b9e8f4bbbe5005ff - default default] [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea] Setting instance vm_state to ERROR: os_brick.exception.VolumeDeviceNotFound: Volume device not found at nqn.nvme-subsystem-compute-0.redhat.local.
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea] Traceback (most recent call last):
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea]   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 10256, in _error_out_instance_on_exception
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea]     yield
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea]   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 5851, in _finish_resize_helper
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea]     network_info = self._finish_resize(context, instance, migration,

....<REMOVED FOR BREVITY>....

2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea]     result = fn(*args, **kwargs)
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea]   File "/usr/lib/python3.9/site-packages/os_brick/initiator/connectors/nvmeof.py", line 908, in _connect_target
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea]     raise exception.VolumeDeviceNotFound(device=target.nqn)
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea] os_brick.exception.VolumeDeviceNotFound: Volume device not found at nqn.nvme-subsystem-compute-0.redhat.local.
2023-08-03 02:01:55.662 2 ERROR nova.compute.manager [instance: 2f89eb85-eccc-4b9c-a500-490237c40fea] 

Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230712.n.1

How reproducible:
50%

Steps to Reproduce:
1. Deploy 17.1 with NVMeOF Cinder backend enabled
2. Create a server and volume and attach the volume to the server
3. Resize the server

Actual results:
Server resize is failing due to attach volume not being found during resize process

Expected results:
server resize action is successful


Additional info:
[1] https://github.com/openstack/tempest/blob/34.1.0/tempest/api/compute/servers/test_server_actions.py#L459C1-L459C58

Comment 4 Brian Rosmaita 2023-08-14 18:35:17 UTC
Marking it low priority; let's keep an eye on this issue.

Comment 8 Gorka Eguileor 2023-09-15 13:15:55 UTC
This is a complex issue that has many sides:

- Nova not requesting volumes with the latest microversion: https://bugs.launchpad.net/nova/+bug/2035375
- Nova not guarding in all operations: https://bugs.launchpad.net/nova/+bug/2035911
- os-brick's NVMe-oF connector not properly handling portals in connecting state: https://bugs.launchpad.net/os-brick/+bug/2035695
- nvmet not sending AER messages when deleting a subsystem

The easiest way to fix the issue is with the os-brick patch: https://review.opendev.org/c/openstack/os-brick/+/895193


Note You need to log in before you can comment on or make changes to this bug.