2280742 – [NVMe-7.1-Tracker] [NVMe HA] [4 GW] One of the nvme gateways is down after few tries to initiate.

Bug 2280742 - [NVMe-7.1-Tracker] [NVMe HA] [4 GW] One of the nvme gateways is down after few tries to initiate.

Summary: [NVMe-7.1-Tracker] [NVMe HA] [4 GW] One of the nvme gateways is down after fe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	NVMeOF
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.1
Assignee:	Aviv Caro
QA Contact:	Manasa
Docs Contact:	ceph-doc-bot
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-16 04:54 UTC by Manasa
Modified:	2024-06-13 14:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ceph-nvmeof-container-1.2.9-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-13 14:32:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-9029	0	None	None	None	2024-05-16 04:58:05 UTC
Red Hat Product Errata	RHSA-2024:3925	0	None	None	None	2024-06-13 14:32:52 UTC

Description Manasa 2024-05-16 04:54:17 UTC

Description of problem:
One of the 4 nvme gateways is down after few tries to initiate. The gateway is not able to automatically restart and come back up.

[root@ceph-mytest-8nuaob-node4 ~]# systemctl |grep nvme
● ceph-c445a1c0-1286-11ef-b00a-fa163ebefec3.ceph-mytest-8nuaob-node4.kpbdfi.service                  loaded failed failed    Ceph nvmeof.nvmeof.ceph-mytest-8nuaob-node4.kpbdfi for c445a1c0-1286-11ef-b00a-fa163ebefec3

This results in all listener addition and other commands fail on this gateway

[root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 subsystem list
Failure listing subsystems:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-05-15T07:14:50.167647904+00:00"}"
>
[root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener list -n nqn.2016-06.io.spdk:cnode1
Failure listing listeners:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-15T07:15:42.767433823+00:00", grpc_status:14}"
>

[root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-8nuaob-node4 -a 10.0.210.138 -s 4420
Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-15T07:09:26.026273276+00:00", grpc_status:14}"
>

[root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-8nuaob-node4 -a 10.0.210.138 -s 4420
Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420:
'NoneType' object has no attribute 'sendall'


Version-Release number of selected component (if applicable):
Ceph image - icr.io/ibm-ceph-beta/ceph-7-rhel9:7-49 
Nvme container image - icr.io/ibm-ceph-beta/nvmeof-rhel9:1.2.7-1 
Nvme cli image - icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 

How reproducible:
1/2

Steps to Reproduce:
1. Deploy a ceph cluster with IBM beta build
2. Deploy 4 nvmeof gateways on this cluster 
3. Try creating subsystems and listeners on the gateways for this cluster
4. First time error appeared while creating listener on the failed gateway

Actual results:
Gateway is down after few tries to initiate.

Expected results:
Gateway should not go down.

Additional info:
Gateway logs are attached for reference.

Comment 7 Aviv Caro 2024-05-20 12:01:49 UTC

We found an issue with Prometheus accessing an uninitialized place in the array. We believe it will fix this problem. (https://github.com/ceph/ceph-nvmeof/pull/653). 

Please retest with New IBM Ceph 7.1 Build (IBM-CEPH-7.1-202405200257.ci.0)

Comment 9 Veera Raghava Reddy 2024-05-24 11:16:47 UTC

Hi Thomas,
Please attach this BZ to 7.1 Errata.

Comment 12 Manasa 2024-05-30 02:29:13 UTC

This particular error is not seen in the latest builds with nvme version 1.2.12-1 and 1.2.13-1. However a different SPDK crash issue was seen starting from versions 1.2.10-1 which is being tracked as part of a different BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2282839 

Hence closing this BZ and marking it as verified. Will track the SPDK issue as part of the other bug mentioned above.

Comment 13 errata-xmlrpc 2024-06-13 14:32:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Note You need to log in before you can comment on or make changes to this bug.