Bug 2280742

Summary:	[NVMe-7.1-Tracker] [NVMe HA] [4 GW] One of the nvme gateways is down after few tries to initiate.
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Manasa <mgowri>
Component:	NVMeOF	Assignee:	Aviv Caro <acaro>
Status:	CLOSED ERRATA	QA Contact:	Manasa <mgowri>
Severity:	high	Docs Contact:	ceph-doc-bot <ceph-doc-bugzilla>
Priority:	unspecified
Version:	7.1	CC:	cephqe-warriors, jcaratza, rlepaksh, tserlin, vereddy
Target Milestone:	---	Keywords:	DeliveryBlocker, TestBlocker
Target Release:	7.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-nvmeof-container-1.2.9-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-06-13 14:32:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Manasa 2024-05-16 04:54:17 UTC

Description of problem:
One of the 4 nvme gateways is down after few tries to initiate. The gateway is not able to automatically restart and come back up.

[root@ceph-mytest-8nuaob-node4 ~]# systemctl |grep nvme
● ceph-c445a1c0-1286-11ef-b00a-fa163ebefec3.ceph-mytest-8nuaob-node4.kpbdfi.service                  loaded failed failed    Ceph nvmeof.nvmeof.ceph-mytest-8nuaob-node4.kpbdfi for c445a1c0-1286-11ef-b00a-fa163ebefec3

This results in all listener addition and other commands fail on this gateway

[root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 subsystem list
Failure listing subsystems:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-05-15T07:14:50.167647904+00:00"}"
>
[root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener list -n nqn.2016-06.io.spdk:cnode1
Failure listing listeners:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-15T07:15:42.767433823+00:00", grpc_status:14}"
>

[root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-8nuaob-node4 -a 10.0.210.138 -s 4420
Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-15T07:09:26.026273276+00:00", grpc_status:14}"
>

[root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-8nuaob-node4 -a 10.0.210.138 -s 4420
Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420:
'NoneType' object has no attribute 'sendall'


Version-Release number of selected component (if applicable):
Ceph image - icr.io/ibm-ceph-beta/ceph-7-rhel9:7-49 
Nvme container image - icr.io/ibm-ceph-beta/nvmeof-rhel9:1.2.7-1 
Nvme cli image - icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 

How reproducible:
1/2

Steps to Reproduce:
1. Deploy a ceph cluster with IBM beta build
2. Deploy 4 nvmeof gateways on this cluster 
3. Try creating subsystems and listeners on the gateways for this cluster
4. First time error appeared while creating listener on the failed gateway

Actual results:
Gateway is down after few tries to initiate.

Expected results:
Gateway should not go down.

Additional info:
Gateway logs are attached for reference.

Comment 7 Aviv Caro 2024-05-20 12:01:49 UTC

We found an issue with Prometheus accessing an uninitialized place in the array. We believe it will fix this problem. (https://github.com/ceph/ceph-nvmeof/pull/653). 

Please retest with New IBM Ceph 7.1 Build (IBM-CEPH-7.1-202405200257.ci.0)

Comment 9 Veera Raghava Reddy 2024-05-24 11:16:47 UTC

Hi Thomas,
Please attach this BZ to 7.1 Errata.

Comment 12 Manasa 2024-05-30 02:29:13 UTC

This particular error is not seen in the latest builds with nvme version 1.2.12-1 and 1.2.13-1. However a different SPDK crash issue was seen starting from versions 1.2.10-1 which is being tracked as part of a different BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2282839 

Hence closing this BZ and marking it as verified. Will track the SPDK issue as part of the other bug mentioned above.

Comment 13 errata-xmlrpc 2024-06-13 14:32:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925