Description of problem: One of the 4 nvme gateways is down after few tries to initiate. The gateway is not able to automatically restart and come back up. [root@ceph-mytest-8nuaob-node4 ~]# systemctl |grep nvme ● ceph-c445a1c0-1286-11ef-b00a-fa163ebefec3.ceph-mytest-8nuaob-node4.kpbdfi.service loaded failed failed Ceph nvmeof.nvmeof.ceph-mytest-8nuaob-node4.kpbdfi for c445a1c0-1286-11ef-b00a-fa163ebefec3 This results in all listener addition and other commands fail on this gateway [root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 subsystem list Failure listing subsystems: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-05-15T07:14:50.167647904+00:00"}" > [root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener list -n nqn.2016-06.io.spdk:cnode1 Failure listing listeners: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-15T07:15:42.767433823+00:00", grpc_status:14}" > [root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-8nuaob-node4 -a 10.0.210.138 -s 4420 Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-15T07:09:26.026273276+00:00", grpc_status:14}" > [root@ceph-mytest-8nuaob-node4 ~]# podman run icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-8nuaob-node4 -a 10.0.210.138 -s 4420 Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420: 'NoneType' object has no attribute 'sendall' Version-Release number of selected component (if applicable): Ceph image - icr.io/ibm-ceph-beta/ceph-7-rhel9:7-49 Nvme container image - icr.io/ibm-ceph-beta/nvmeof-rhel9:1.2.7-1 Nvme cli image - icr.io/ibm-ceph-beta/nvmeof-cli-rhel9:1.2.7-1 How reproducible: 1/2 Steps to Reproduce: 1. Deploy a ceph cluster with IBM beta build 2. Deploy 4 nvmeof gateways on this cluster 3. Try creating subsystems and listeners on the gateways for this cluster 4. First time error appeared while creating listener on the failed gateway Actual results: Gateway is down after few tries to initiate. Expected results: Gateway should not go down. Additional info: Gateway logs are attached for reference.
We found an issue with Prometheus accessing an uninitialized place in the array. We believe it will fix this problem. (https://github.com/ceph/ceph-nvmeof/pull/653). Please retest with New IBM Ceph 7.1 Build (IBM-CEPH-7.1-202405200257.ci.0)
Hi Thomas, Please attach this BZ to 7.1 Errata.
This particular error is not seen in the latest builds with nvme version 1.2.12-1 and 1.2.13-1. However a different SPDK crash issue was seen starting from versions 1.2.10-1 which is being tracked as part of a different BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2282839 Hence closing this BZ and marking it as verified. Will track the SPDK issue as part of the other bug mentioned above.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3925