2272661 – mon: add NVMe-oF gateway monitor and HA

Bug 2272661 - mon: add NVMe-oF gateway monitor and HA

Summary: mon: add NVMe-oF gateway monitor and HA

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	NVMeOF
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	7.1
Assignee:	Aviv Caro
QA Contact:	Manohar Murthy
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2267614 2298578 2298579
TreeView+	depends on / blocked

Reported:	2024-04-02 13:08 UTC by Aviv Caro
Modified:	2024-11-16 04:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-18.2.1-121.el9cp
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-13 14:31:06 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8720	0	None	None	None	2024-04-02 13:13:59 UTC
Red Hat Product Errata	RHSA-2024:3925	0	None	None	None	2024-06-13 14:31:11 UTC

Description Aviv Caro 2024-04-02 13:08:59 UTC

Description of problem: New feature (Fixes: https://tracker.ceph.com/issues/64777)

This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes).
The implementation consists of the following main modules:

NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored.
NVMeofGwMonitorClient – It is an agent that is running as a part of each nvmeof GW. It is sending beacons to the monitor to signal that the GW is alive. As a part of the beacon, the client also sends information about the service. This information is used by the monitor to take decisions and perform some operations.
MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons.
MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes.

It is also adding 2 new mon commands:

nvme-gw create
nvme-gw delete

The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted.

Comment 2 Aviv Caro 2024-04-03 18:35:54 UTC

https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/527

Comment 5 tserlin 2024-04-04 20:24:54 UTC

Please remember to add the line "Resolves: rhbz#XXXXXXX" to each commit in the GitLab Merge Request. Our downstream Jenkins instance depends on this.

Thanks,

Thomas

Comment 9 harika chebrolu 2024-04-24 06:58:42 UTC

With latest build, we are able to run commands
ceph version 18.2.1-149.el9cp (6944266a2186e8940baeefc45140e9c798b90141) reef (stable)
nvme image: cp.stg.icr.io/cp/ibm-ceph/nvmeof-rhel9:1.2.4-1


[ceph: root@tala014 /]# ceph nvme-gw show nvmeof_pool ''
{
    "pool": "nvmeof_pool",
    "group": "",
    "num gws": 2,
    "Anagrp list": "[ 1 2 ]"
}
{
    "gw-id": "client.nvmeof.nvmeof_pool.argo023.zhdyfm",
    "anagrp-id": 1,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 1: ACTIVE , 2: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof_pool.argo024.frleqy",
    "anagrp-id": 2,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 1: STANDBY , 2: ACTIVE ,"
}

Comment 12 errata-xmlrpc 2024-06-13 14:31:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Comment 13 Red Hat Bugzilla 2024-11-16 04:25:36 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.