2280332 – nvmeof GW exits after fully startup is not performed after brought back post failover leading to WAIT_FAILBACK_PREPARED ana_state

Bug 2280332 - nvmeof GW exits after fully startup is not performed after brought back post failover leading to WAIT_FAILBACK_PREPARED ana_state

Summary: nvmeof GW exits after fully startup is not performed after brought back post ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	NVMeOF
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	7.1
Assignee:	Aviv Caro
QA Contact:	Rahul Lepakshi
Docs Contact:	ceph-doc-bot
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-14 08:27 UTC by Rahul Lepakshi
Modified:	2024-06-13 14:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ceph-18.2.1-176.el9cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-13 14:32:49 UTC
Embargoed:
Dependent Products:
Flags:	rlepaksh: needinfo- rlepaksh: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-9009	0	None	None	None	2024-05-14 08:28:44 UTC
Red Hat Product Errata	RHSA-2024:3925	0	None	None	None	2024-06-13 14:32:51 UTC

Description Rahul Lepakshi 2024-05-14 08:27:30 UTC

Description of problem:
Post failover the GW and bringing it back up, GW does not startup with latest updates as in omap and is stuck even to list subsystems and eventually exits.

[root@argo023 nvmeof-client.nvmeof.nvmeof_pool.argo023.qmbpxi]# nvmeof subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤══════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │   Serial │ Controller IDs   │   Namespace │          Max │
│           │                            │            │   Number │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪══════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │        1 │ 2041-4080        │         93 │         2048 │
├───────────┼────────────────────────────┼────────────┼──────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │        2 │ 2041-4080        │         95│         2048 │
╘═══════════╧════════════════════════════╧════════════╧══════════╧══════════════════╧═════════════╧══════════════╛


Version-Release number of selected component (if applicable):


How reproducible:2/2


Steps to Reproduce:
1. Deploy ceph cluster and nvmeof service 
2. Perform a failover and bring back the failed GW again for Failback
3. Observe GW does not fully startup to load all GW components on omap and no CLI commands output is accurate 

Actual results: GW does not perform full startup right. Now the GW is in situation where it could not list subsystem/ namespaces


Expected results: We expect GW to perform full startup right.


Additional info:

Comment 1 RHEL Program Management 2024-05-14 08:27:40 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Aviv Caro 2024-05-14 08:40:25 UTC

Issue is understood. Leonid prepares a fix.

Comment 3 Rahul Lepakshi 2024-05-14 15:22:39 UTC

@aviv.caro I am terming this as blocker as ina 2 GW config, if other GW also comes down for some reason,  there is no one to handle IO and maintain namespaces in ceph cluster. We hit Data unavailability in this case

Comment 4 Aviv Caro 2024-05-16 09:03:39 UTC

Fixed by https://gitlab.cee.redhat.com/ceph/ceph/-/commit/7837cc865312b562228519e3efdf658a3cde4193

Comment 8 Rahul Lepakshi 2024-05-28 05:03:25 UTC

Not seeing this issue on recent builds, but I have an Observation on scale cluster, GW takes at least 2 minutes to load NS post GW comes to ACTIVE STANDBY state. 


[root@argo023 ~]# nvmeof subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤════════════════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │ Serial             │ Controller IDs   │   Namespace │          Max │
│           │                            │            │ Number             │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪════════════════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │ Ceph76593830561176 │ 2041-4080        │         179 │          400 │
├───────────┼────────────────────────────┼────────────┼────────────────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │ Ceph50770207011824 │ 2041-4080        │         182 │          400 │
╘═══════════╧════════════════════════════╧════════════╧════════════════════╧══════════════════╧═════════════╧══════════════╛
[root@argo023 ~]# nvmeof subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤════════════════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │ Serial             │ Controller IDs   │   Namespace │          Max │
│           │                            │            │ Number             │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪════════════════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │ Ceph76593830561176 │ 2041-4080        │         190 │          400 │
├───────────┼────────────────────────────┼────────────┼────────────────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │ Ceph50770207011824 │ 2041-4080        │         191 │          400 │
╘═══════════╧════════════════════════════╧════════════╧════════════════════╧══════════════════╧═════════════╧══════════════╛
[root@argo023 ~]# nvmeof subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤════════════════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │ Serial             │ Controller IDs   │   Namespace │          Max │
│           │                            │            │ Number             │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪════════════════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │ Ceph76593830561176 │ 2041-4080        │         196 │          400 │
├───────────┼────────────────────────────┼────────────┼────────────────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │ Ceph50770207011824 │ 2041-4080        │         197 │          400 │
╘═══════════╧════════════════════════════╧════════════╧════════════════════╧══════════════════╧═════════════╧══════════════╛
[root@argo023 ~]# nvmeof subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤════════════════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │ Serial             │ Controller IDs   │   Namespace │          Max │
│           │                            │            │ Number             │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪════════════════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │ Ceph76593830561176 │ 2041-4080        │         200 │          400 │
├───────────┼────────────────────────────┼────────────┼────────────────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │ Ceph50770207011824 │ 2041-4080        │         200 │          400 │
╘═══════════╧════════════════════════════╧════════════╧════════════════════╧══════════════════╧═════════════╧══════════════╛

Comment 9 errata-xmlrpc 2024-06-13 14:32:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Note You need to log in before you can comment on or make changes to this bug.