Description of problem: NVMe gateway service fails after listener and host addition in latest RC build Version-Release number of selected component (if applicable): ceph 7.1 RC Build - 18.2.1-188 How reproducible: Tried once Steps to Reproduce: 1. Deploy 4 NVME gateways on the ceph cluster. 2. Create subsystems on the gateway 3. Add listeners and host access to the gateway The following sequence of steps were performed to add listeners and host access. Gateway kept going down and coming back up,but during host addition gateway went into a failure state and never came back up. [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 subsystem add -n nqn.2016-06.io.spdk:cnode1 -s 1 -m 400 Adding subsystem nqn.2016-06.io.spdk:cnode1: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 subsystem add -n nqn.2016-06.io.spdk:cnode2 -s 2 -m 400 Adding subsystem nqn.2016-06.io.spdk:cnode2: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 subsystem add -n nqn.2016-06.io.spdk:cnode3 -s 3 -m 400 Adding subsystem nqn.2016-06.io.spdk:cnode3: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 subsystem add -n nqn.2016-06.io.spdk:cnode4 -s 4 -m 400 Adding subsystem nqn.2016-06.io.spdk:cnode4: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode1 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420 Adding nqn.2016-06.io.spdk:cnode1 listener at 10.0.210.138:4420: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode2 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420 Failure adding nqn.2016-06.io.spdk:cnode2 listener at 10.0.210.138:4420: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-23T03:44:29.136318146+00:00", grpc_status:14}" > [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420 Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-05-23T03:44:34.843914196+00:00"}" > [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420 Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-23T03:45:02.714212308+00:00", grpc_status:14}" > [root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service loaded activating start start Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20 [root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service loaded active running Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20 [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode2 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420 Adding nqn.2016-06.io.spdk:cnode2 listener at 10.0.210.138:4420: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420 Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-23T03:45:45.035296346+00:00", grpc_status:14}" > [root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service loaded active running Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20 [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420 Adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode4 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420 Adding nqn.2016-06.io.spdk:cnode4 listener at 10.0.210.138:4420: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 host add -n nqn.2016-06.io.spdk:cnode1 -t "*" Allowing open host access to nqn.2016-06.io.spdk:cnode1: Successful [root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 host add -n nqn.2016-06.io.spdk:cnode2 -t "*" Failure allowing open host access to nqn.2016-06.io.spdk:cnode2: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-05-23T03:46:30.059226616+00:00"}" > [root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service loaded activating auto-restart Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20 [root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof ● ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service loaded failed failed Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20 [root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof ● ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service loaded failed failed Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20 Actual results: Gateway service failed after trying to add listeners and hosts Expected results: Gateway service doesn't fail Additional info: Attaching complete journalctl logs of the failed gateway.
*** This bug has been marked as a duplicate of bug 2281800 ***
The error seen in this bug, is again seen post upgrade to ceph version ceph-18.2.1-190.el9cp. Gateway came up to running state once upgrade completed. However gateway again went to failed state while I tried to list the listeners. Attached the nvme service log for the failed gateway. Reopening this bug as per discussion wit Aviv. Also I have provided cluster access to the team.
We can clearly see now the spdk segmentation fault in the rpc call to get qpairs. We are also able to reproduce it in a unit test in dev team. We prepare to options for 7.1: - Option 1 (ready in GW 1.2.11, and waiting for downstream build), is to avoid calling this qpairs method. It means that some CLI and statistics will be missing, but it is not a critical functionality. - Option 2 (a fix is being tested in dev now), is to fix the segmentation fault in SPDK. This is better because this way we don't need to remove any functionality. But this means that build team needs to take this fix and rebuild SPDK. I assume this can be done very quickly. Need to verify with build team (assuming our fix works).
Able to reproduce this issue while adding listeners to a gateway in another cluster as well. Attaching the nvme service logs for this cluster as well here. Attachment name - nvme_svc_newgw_log.txt
Also, while performing leader mon kill (removing the leader mon from ceph cluster), we observed that one of the gateways which was working before mon kill, starts toggling between on and off states for a while. We see that the error seen in the service log is similar to the error reported here. So maybe this could also be the same issue described above. Attaching the logs for this issue as well here with name - nvme_log_leadermonkill_scenario.txt
Log file for the above comment 10 - https://drive.google.com/file/d/19tizZxn9fNBamE8aqd_GLVuux2lYwoBE/view?usp=sharing
I am also facing similar issue in my scale cluster. Steps I followed: 1. Created 8 subsystems, configured listeners and hosts 2. Adding 400 namespaces for each subsystem and run IO into them. Error: Initially, when adding namespaces to the second namespace, it didn't work. However, after trying again later, it worked as expected. Then, it failed again when adding the fifth subsystem and when I see the error logs looks like its related to qpairs issue ` May 27 14:41:05 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:05] INFO grpc.py:760: Received request to add bdev_5c77a3fd-1b89-4002-a38b-74ae910534c6 to nqn.2016-06.io.spdk:cnode4 with ANA group id 2 using NSID 50 and UUID 5c77a3fd-1b89-4002-a38b-74ae910534c6 May 27 14:41:27 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:27] INFO grpc.py:1232: Received request to list all namespaces for nqn.2016-06.io.spdk:cnode4, context: <grpc._server._Context object at 0x7f6fb77bfe80>, client address: IPv4 10.88.0.215:56624 May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR grpc.py:1967: Failure listing connections, can't get qpairs May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: Traceback (most recent call last): May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 1963, in list_connections_safe May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: qpair_ret = rpc_nvmf.nvmf_subsystem_get_qpairs(self.spdk_rpc_client, nqn=request.subsystem) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 637, in nvmf_subsystem_get_qpairs May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: return client.call('nvmf_subsystem_get_qpairs', params) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 196, in call May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: raise e May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: response = self.recv() May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 170, in recv May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: raise JSONRPCException("Connection closed with partial response:\n%s\n" % self._recv_buf) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: spdk.rpc.client.JSONRPCException: Connection closed with partial response: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR server.py:42: GatewayServer: SIGCHLD received signum=17 May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR prometheus.py:201: Exporter failed to fetch connection info for nqn.2016-06.io.spdk:cnode7: Failure listing connections, can't get qpairs: Connection closed with partial response: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR grpc.py:2457: get_subsystems failed May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: Traceback (most recent call last): May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2455, in get_subsystems_safe May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: ret = rpc_nvmf.nvmf_get_subsystems(self.spdk_rpc_client) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 177, in nvmf_get_subsystems May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: return client.call('nvmf_get_subsystems', params) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 187, in call May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: req_id = self.send(method, params) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 145, in send May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: self.flush() May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 141, in flush May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: self.sock.sendall(reqstr.encode("utf-8")) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: AttributeError: 'NoneType' object has no attribute 'sendall' May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR grpc.py:1967: Failure listing connections, can't get qpairs May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: Traceback (most recent call last): May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 1963, in list_connections_safe May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: qpair_ret = rpc_nvmf.nvmf_subsystem_get_qpairs(self.spdk_rpc_client, nqn=request.subsystem) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 637, in nvmf_subsystem_get_qpairs May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: return client.call('nvmf_subsystem_get_qpairs', params) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 187, in call May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: req_id = self.send(method, params) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 145, in send May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: self.flush() May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 141, in flush May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: self.sock.sendall(reqstr.encode("utf-8")) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: AttributeError: 'NoneType' object has no attribute 'sendall' May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR server.py:46: PID of terminated child process is 82 May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR prometheus.py:201: Exporter failed to fetch connection info for nqn.2016-06.io.spdk:cnode8: Failure listing connections, can't get qpairs: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: 'NoneType' object has no attribute 'sendall' May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR server.py:110: GatewayServer exception occurred: May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: Traceback (most recent call last): May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/remote-source/ceph-nvmeof/app/control/server.py", line 521, in _ping May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: ret = spdk.rpc.spdk_get_version(self.spdk_rpc_ping_client) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/__init__.py", line 78, in spdk_get_version May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: return client.call('spdk_get_version') May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: response = self.recv() May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 166, in recv May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: newdata = self.sock.recv(4096) May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: ConnectionResetError: [Errno 104] Connection reset by peer `
+ 1 Even I am observing same issue along with similar issues which will root cause to same issue again bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_0 elapsed time bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_1 elapsed time bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_2 elapsed time bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_3 elapsed time bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_4 elapsed time bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_5 elapsed time bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_6 elapsed time May 27 12:18:42 argo023 [201546]: [2024-05-27 12:18:42.300198] tcp.c:1574:nvmf_tcp_qpair_set_recv_state: *ERROR*: The recv state of tqpair=0xbccc820 is same with the state(5) to be set May 27 12:18:42 argo023 [201546]: [2024-05-27 12:18:42.300240] tcp.c:1574:nvmf_tcp_qpair_set_recv_state: *ERROR*: The recv state of tqpair=0xbccc820 is same with the state(5) to be set May 27 12:18:51 argo023 [201546]: [2024-05-27 12:18:51.238527] ctrlr.c: 178:nvmf_ctrlr_keep_alive_poll: *NOTICE*: Disconnecting host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0056-3610-8047-b8c04f525433 from subsystem nqn.2016-06.io.spdk:cnode2 due to keep alive timeout. May 27 12:19:19 argo023 [201546]: Traceback (most recent call last): May 27 12:19:19 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2056, in list_connections_safe May 27 12:19:19 argo023 [201546]: one_conn = pb2.connection(nqn=hostnqn, connected=True, May 27 12:19:19 argo023 [201546]: ValueError: unknown enum label "" May 27 12:27:49 argo023 [201546]: During handling of the above exception, another exception occurred: May 27 12:27:49 argo023 [201546]: Traceback (most recent call last): May 27 12:27:49 argo023 [201546]: File "/usr/lib64/python3.9/wsgiref/handlers.py", line 137, in run May 27 12:27:49 argo023 [201546]: self.result = application(self.environ, self.start_response) May 27 12:27:49 argo023 [201546]: File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 128, in prometheus_app May 27 12:27:49 argo023 [201546]: status, headers, output = _bake_output(registry, accept_header, accept_encoding_header, params, disable_compression) May 27 12:27:49 argo023 [201546]: File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 104, in _bake_output May 27 12:27:49 argo023 [201546]: output = encoder(registry) May 27 12:27:49 argo023 [201546]: File "/usr/local/lib/python3.9/site-packages/prometheus_client/openmetrics/exposition.py", line 21, in generate_latest May 27 12:27:49 argo023 [201546]: for metric in registry.collect(): May 27 12:27:49 argo023 [201546]: File "/usr/local/lib/python3.9/site-packages/prometheus_client/registry.py", line 97, in collect May 27 12:27:49 argo023 [201546]: yield from collector.collect() May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 43, in wrapped May 27 12:27:49 argo023 [201546]: for metric in method(self, *args, **kwargs): May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 225, in collect May 27 12:27:49 argo023 [201546]: self._get_data() May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 213, in _get_data May 27 12:27:49 argo023 [201546]: self.connections = self._get_connection_map(self.subsystems) May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 57, in call May 27 12:27:49 argo023 [201546]: result = method(self, *args, **kwargs) May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 199, in _get_connection_map May 27 12:27:49 argo023 [201546]: resp = self.gateway_rpc.list_connections(pb2.list_connections_req(subsystem=subsys.nqn)) May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2075, in list_connections May 27 12:27:49 argo023 [201546]: return self.execute_grpc_function(self.list_connections_safe, request, context) May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 264, in execute_grpc_function May 27 12:27:49 argo023 [201546]: return self.omap_lock.execute_omap_locking_function(self._grpc_function_with_lock, func, request, context) May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/state.py", line 253, in execute_omap_locking_function May 27 12:27:49 argo023 [201546]: return grpc_func(omap_locking_func, request, context) May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 256, in _grpc_function_with_lock May 27 12:27:49 argo023 [201546]: return func(request, context) May 27 12:27:49 argo023 [201546]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2063, in list_connections_safe May 27 12:27:49 argo023 [201546]: self.logger.exception(f"{con=} parse error") May 27 12:27:49 argo023 [201546]: NameError: name 'con' is not defined May 27 11:48:39 argo023 [30679]: 4c2a-a6fb-a47c6cbf4127', 'nguid': '5D853CF96C714C2AA6FBA47C6CBF4127', 'uuid': '5d853cf9-6c71-4c2a-a6fb-a47c6cbf4127', 'anagrpid': 2}, {'nsid': 79, 'bdev_name': 'bdev_0b132b8e-b7e6-43e5-a343-50e770796cb1', 'name': 'bdev_0b132b8e-b7e6-43e5-a343-50e770796cb1', 'nguid': '0B132B8EB7E643E5A34350E770796CB1', 'uuid': '0b132b8e-b7e6-43e5-a343-50e770796cb1', 'anagrpid': 1}, {'nsid': 80, 'bdev_name': 'bdev_b7ec2076-cb6a-41e9-a5fa-77ed05a3b418', 'name': 'bdev_b7ec2076-cb6a-41e9-a5fa-77ed05a3b418', 'nguid': 'B7EC2076CB6A41E9A5FA77ED05A3B418', 'uuid': 'b7ec2076-cb6a-41e9-a5fa-77ed05a3b418', 'anagrpid': 2}, {'nsid': 81, 'bdev_name': 'bdev_75a4082e-7777-4f5f-b8c3-365536a85e26', 'name': 'bdev_75a4082e-7777-4f5f-b8c3-365536a85e26', 'nguid': '75A4082E77774F5FB8C3365536A85E26', 'uuid': '75a4082e-7777-4f5f-b8c3-365536a85e26', 'anagrpid': 1}, {'nsid': 82, 'bdev_name': 'bdev_cb49859b-e787-4f85-bd7e-25c817bfab19', 'name': 'bdev_cb49859b-e787-4f85-bd7e-25c817bfab19', 'nguid': 'CB49859BE7874F85BD7E25C817BFAB19', 'uuid': 'cb49859b-e787-4f85-bd7e-25c817bfab19', 'anagrpid': 2}, {'nsid': 83, 'bdev_name': 'bdev_19bc99fa-e652-47f8-90d2-dd5e74e70e70', 'name': 'bdev_19bc99fa-e652-47f8-90d2-dd5e74e70e70', 'nguid': '19BC99FAE65247F890D2DD5E74E70E70', 'uuid': '19bc99fa-e652-47f8-90d2-dd5e74e70e70', 'anagrpid': 1}, {'nsid': 84, 'bdev_name': 'bdev_a94840ff-0670-474c-b16f-573062948433', 'name': 'bdev_a94840ff-0670-474c-b16f-573062948433', 'nguid': 'A94840FF0670474CB16F573062948433', 'uuid': 'a94840ff-0670-474c-b16f-573062948433', 'anagrpid': 2}, {'nsid': 85, 'bdev_name': 'bdev_13cb0586-87f4-49ec-a765-23ddeb89a948', 'name': 'bdev_13cb0586-87f4-49ec-a765-23ddeb89a948', 'nguid': '13CB058687F449ECA76523DDEB89A948', 'uuid': '13cb0586-87f4-49ec-a765-23ddeb89a948', 'anagrpid': 1}, {'nsid': 86, 'bdev_name': 'bdev_517eb881-b9e8-4183-a45f-43ce3b13f5d5', 'name': 'bdev_517eb881-b9e8-4183-a45f-43ce3b13f5d5', 'nguid': '517EB881B9E84183A45F43CE3B13F5D5', 'uuid': '517eb881-b9e8-4183-a45f-43ce3b13f5d5', 'anagrpid': 2}, {'nsid': 87, 'bdev_name': 'bdev_60760af1-fe3d-440a-83e2-8f4114eeddb9', 'name': 'bdev_60760af1-fe3d-440a-83e2-8f4114eeddb9', 'nguid': '60760AF1FE3D440A83E28F4114EEDDB9', 'uuid': '60760af1-fe3d-440a-83e2-8f4114eeddb9', 'anagrpid': 1}, {'nsid': 88, 'bdev_name': 'bdev_dcec734a-9886-4766-97df-d85f68d82a9b', 'name': 'bdev_dcec734a-9886-4766-97df-d85f68d82a9b', 'nguid': 'DCEC734A9886476697DFD85F68D82A9B', 'uuid': 'dcec734a-9886-4766-97df-d85f68d82a9b', 'anagrpid': 2}, {'nsid': 89, 'bdev_name': 'bdev_e2b05ab2-add0-41d8-8ff6-071dc289abfd', 'name': 'bdev_e2b05ab2-add0-41d8-8ff6-071dc289abfd', 'nguid': 'E2B05AB2ADD041D88FF6071DC289ABFD', 'uuid': 'e2b05ab2-add0-41d8-8ff6-071dc289abfd', 'anagrpid': 1}, {'nsid': 90, 'bdev_name': 'bdev_5e73c16b-9860-4e95-b6b3-0e3e693e1af0', 'name': 'bdev_5e73c16b-9860-4e95-b6b3-0e3e693e1af0', 'nguid': '5E73C16B98604E95B6B30E3E693E1AF0', 'uuid': '5e73c16b-9860-4e95-b6b3-0e3e693e1af0', 'anagrpid': 2}, {'nsid': 91, 'bdev_name': 'bdev_34aa1267-e173-43bb-92c7-f1115777780c', 'name': 'bdev_34aa1267-e173-43bb-92c7-f1115777780c', 'nguid': '34AA1267E17343BB92C7F1115777780C', 'uuid': '34aa1267-e173-43bb-92c7-f1115777780c', 'anagrpid': 1}, {'nsid': 92, 'bdev_name': 'bdev_8e60b104-8b4c-4d36-80a5-af9cfaf976d6', 'name': 'bdev_8e60b104-8b4c-4d36-80a5-af9cfaf976d6', 'nguid': '8E60B1048B4C4D3680A5AF9CFAF976D6', 'uuid': '8e60b104-8b4c-4d36-80a5-af9cfaf976d6', 'anagrpid': 2}, {'nsid': 93, 'bdev_name': 'bdev_568247e6-4cf3-4200-849e-bf5f714ca877', 'name': 'bdev_568247e6-4cf3-4200-849e-bf5f714ca877', 'nguid': '568247E64CF34200849EBF5F714CA877', 'uuid': '568247e6-4cf3-4200-849e-bf5f714ca877', 'anagrpid': 1}, {'nsid': 94, 'bdev_name': 'bdev_46615ebf-2a5a-43f1-9070-22be877a9346', 'name': 'bdev_46615ebf-2a5a-43f1-9070-22be877a9346', 'nguid': '46615EBF2A5A43F1907022BE877A9346', 'uuid': '46615ebf-2a5a-43f1-9070-22be877a9346', 'anagrpid': 2}, {'nsid': 95, 'bdev_name': 'bdev_764372d4-6551-4d16-9d92-22d387b3e9d7', 'name': 'bdev_764372d4-6551-4d16-9d92-22d387b3e9d7', 'nguid': '764372D465514D169D9222D387B3E9D7', 'uuid': '764372d4-6551-4d16-9d92-22d387b3e9d7', 'anagrpid': 1}, {[27-May-2024 11:48:39] DEBUG grpc.py:2083: Subsystem nqn.2016-06.io.spdk:cnode1 enable_ha: True May 27 11:48:39 argo023 [30679]: [27-May-2024 11:48:39] DEBUG grpc.py:2083: Subsystem nqn.2016-06.io.spdk:cnode2 enable_ha: True May 27 11:48:39 argo023 [30679]: [27-May-2024 11:48:39] DEBUG grpc.py:1961: Received request to list connections for nqn.2016-06.io.spdk:cnode1, context: None May 27 11:48:46 argo023 [30679]: [27-May-2024 11:48:46] ERROR server.py:42: GatewayServer: SIGCHLD received signum=17 May 27 11:48:46 argo023 [30679]: [27-May-2024 11:48:46] ERROR server.py:46: PID of terminated child process is 133 May 27 11:48:46 argo023 [30679]: [27-May-2024 11:48:46] ERROR server.py:110: GatewayServer exception occurred: May 27 11:48:46 argo023 [30679]: Traceback (most recent call last): May 27 11:48:46 argo023 [30679]: File "/remote-source/ceph-nvmeof/app/control/server.py", line 521, in _ping May 27 11:48:46 argo023 [30679]: ret = spdk.rpc.spdk_get_version(self.spdk_rpc_ping_client) May 27 11:48:46 argo023 [30679]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/__init__.py", line 78, in spdk_get_version May 27 11:48:46 argo023 [30679]: return client.call('spdk_get_version') May 27 11:48:46 argo023 [30679]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call May 27 11:48:46 argo023 [30679]: response = self.recv() May 27 11:48:46 argo023 [30679]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 166, in recv May 27 11:48:46 argo023 [30679]: newdata = self.sock.recv(4096) May 27 11:48:46 argo023 [30679]: ConnectionResetError: [Errno 104] Connection reset by peer May 28 05:17:36 argo023 [565399]: [28-May-2024 05:17:36] INFO grpc.py:2099: Received request to create argo023 TCP ipv4 listener for nqn.2016-06.io.spdk:cnode2 at 10.8.128.223:4420, context: None May 28 05:17:56 argo023 [565399]: [2024-05-28 05:17:56.134518] ctrlr.c: 178:nvmf_ctrlr_keep_alive_poll: *NOTICE*: Disconnecting host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0056-3610-8047-b8c04f525433 from subsystem nqn.2016-06.io.spdk:cnode2 due to keep alive timeout. May 28 05:18:13 argo023 [565399]: [2024-05-28 05:18:13.021165] ctrlr.c: 178:nvmf_ctrlr_keep_alive_poll: *NOTICE*: Disconnecting host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0056-3610-8047-b8c04f525433 from subsystem nqn.2016-06.io.spdk:cnode1 due to keep alive timeout. May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR server.py:42: GatewayServer: SIGCHLD received signum=17 May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR server.py:46: PID of terminated child process is 82 May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR grpc.py:1967: Failure listing connections, can't get qpairs May 28 05:18:34 argo023 [565399]: Traceback (most recent call last): May 28 05:18:34 argo023 [565399]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 1963, in list_connections_safe May 28 05:18:34 argo023 [565399]: qpair_ret = rpc_nvmf.nvmf_subsystem_get_qpairs(self.spdk_rpc_client, nqn=request.subsystem) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 637, in nvmf_subsystem_get_qpairs May 28 05:18:34 argo023 [565399]: return client.call('nvmf_subsystem_get_qpairs', params) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 196, in call May 28 05:18:34 argo023 [565399]: raise e May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call May 28 05:18:34 argo023 [565399]: response = self.recv() May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 170, in recv May 28 05:18:34 argo023 [565399]: raise JSONRPCException("Connection closed with partial response:\n%s\n" % self._recv_buf) May 28 05:18:34 argo023 [565399]: spdk.rpc.client.JSONRPCException: Connection closed with partial response: May 28 05:18:34 argo023 [565399]: May 28 05:18:34 argo023 [565399]: May 28 05:18:34 argo023 [565399]: May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR server.py:110: GatewayServer exception occurred: May 28 05:18:34 argo023 [565399]: Traceback (most recent call last): May 28 05:18:34 argo023 [565399]: File "/remote-source/ceph-nvmeof/app/control/server.py", line 521, in _ping May 28 05:18:34 argo023 [565399]: ret = spdk.rpc.spdk_get_version(self.spdk_rpc_ping_client) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/__init__.py", line 78, in spdk_get_version May 28 05:18:34 argo023 [565399]: return client.call('spdk_get_version') May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call May 28 05:18:34 argo023 [565399]: response = self.recv() May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 166, in recv May 28 05:18:34 argo023 [565399]: newdata = self.sock.recv(4096) May 28 05:18:34 argo023 [565399]: ConnectionResetError: [Errno 104] Connection reset by peer May 28 05:18:34 argo023 [565399]: May 28 05:18:34 argo023 [565399]: During handling of the above exception, another exception occurred: May 28 05:18:34 argo023 [565399]: May 28 05:18:34 argo023 [565399]: Traceback (most recent call last): May 28 05:18:34 argo023 [565399]: File "/remote-source/ceph-nvmeof/app/control/__main__.py", line 45, in <module> May 28 05:18:34 argo023 [565399]: gateway.keep_alive() May 28 05:18:34 argo023 [565399]: File "/remote-source/ceph-nvmeof/app/control/server.py", line 508, in keep_alive May 28 05:18:34 argo023 [565399]: alive = self._ping() May 28 05:18:34 argo023 [565399]: File "/remote-source/ceph-nvmeof/app/control/server.py", line 524, in _ping May 28 05:18:34 argo023 [565399]: self.logger.exception(f"spdk_get_version failed") May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 1481, in exception May 28 05:18:34 argo023 [565399]: self.error(msg, *args, exc_info=exc_info, **kwargs) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 1475, in error May 28 05:18:34 argo023 [565399]: self._log(ERROR, msg, args, **kwargs) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 1589, in _log May 28 05:18:34 argo023 [565399]: self.handle(record) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 1599, in handle May 28 05:18:34 argo023 [565399]: self.callHandlers(record) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 1661, in callHandlers May 28 05:18:34 argo023 [565399]: hdlr.handle(record) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 952, in handle May 28 05:18:34 argo023 [565399]: self.emit(record) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/handlers.py", line 73, in emit May 28 05:18:34 argo023 [565399]: if self.shouldRollover(record): May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/handlers.py", line 194, in shouldRollover May 28 05:18:34 argo023 [565399]: msg = "%s\n" % self.format(record) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 927, in format May 28 05:18:34 argo023 [565399]: return fmt.format(record) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 671, in format May 28 05:18:34 argo023 [565399]: record.exc_text = self.formatException(record.exc_info) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/logging/__init__.py", line 621, in formatException May 28 05:18:34 argo023 [565399]: traceback.print_exception(ei[0], ei[1], tb, None, sio) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/traceback.py", line 103, in print_exception May 28 05:18:34 argo023 [565399]: for line in TracebackException( May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/traceback.py", line 517, in __init__ May 28 05:18:34 argo023 [565399]: self.stack = StackSummary.extract( May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/traceback.py", line 366, in extract May 28 05:18:34 argo023 [565399]: f.line May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/traceback.py", line 288, in line May 28 05:18:34 argo023 [565399]: self._line = linecache.getline(self.filename, self.lineno).strip() May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/linecache.py", line 30, in getline May 28 05:18:34 argo023 [565399]: lines = getlines(filename, module_globals) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/linecache.py", line 46, in getlines May 28 05:18:34 argo023 [565399]: return updatecache(filename, module_globals) May 28 05:18:34 argo023 [565399]: File "/usr/lib64/python3.9/linecache.py", line 93, in updatecache May 28 05:18:34 argo023 [565399]: stat = os.stat(fullname) May 28 05:18:34 argo023 [565399]: File "/remote-source/ceph-nvmeof/app/control/server.py", line 55, in sigchld_handler May 28 05:18:34 argo023 [565399]: raise SystemExit(f"Gateway subprocess terminated {pid=} {exit_code=}") May 28 05:18:34 argo023 [565399]: SystemExit: Gateway subprocess terminated pid=82 exit_code=-11 May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR prometheus.py:201: Exporter failed to fetch connection info for nqn.2016-06.io.spdk:cnode1: Failure listing connections, can't get qpairs: Connection closed with partial response: May 28 05:18:34 argo023 [565399]: May 28 05:18:34 argo023 [565399]: May 28 05:18:34 argo023 [565399]: May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR grpc.py:2457: get_subsystems failed May 28 05:18:34 argo023 [565399]: Traceback (most recent call last): May 28 05:18:34 argo023 [565399]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2455, in get_subsystems_safe May 28 05:18:34 argo023 [565399]: ret = rpc_nvmf.nvmf_get_subsystems(self.spdk_rpc_client) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 177, in nvmf_get_subsystems May 28 05:18:34 argo023 [565399]: return client.call('nvmf_get_subsystems', params) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 187, in call May 28 05:18:34 argo023 [565399]: req_id = self.send(method, params) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 145, in send May 28 05:18:34 argo023 [565399]: self.flush() May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 141, in flush May 28 05:18:34 argo023 [565399]: self.sock.sendall(reqstr.encode("utf-8")) May 28 05:18:34 argo023 [565399]: AttributeError: 'NoneType' object has no attribute 'sendall' May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] INFO server.py:427: Aborting (client.nvmeof.nvmeof_pool.argo023.qcoxrd) pid 19... May 28 05:18:34 argo023 [565399]: *** Caught signal (Aborted) ** May 28 05:18:34 argo023 [565399]: in thread 7f364e1155c0 thread_name:ceph-nvmeof-mon May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR grpc.py:1967: Failure listing connections, can't get qpairs May 28 05:18:34 argo023 [565399]: Traceback (most recent call last): May 28 05:18:34 argo023 [565399]: File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 1963, in list_connections_safe May 28 05:18:34 argo023 [565399]: qpair_ret = rpc_nvmf.nvmf_subsystem_get_qpairs(self.spdk_rpc_client, nqn=request.subsystem) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 637, in nvmf_subsystem_get_qpairs May 28 05:18:34 argo023 [565399]: return client.call('nvmf_subsystem_get_qpairs', params) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 187, in call May 28 05:18:34 argo023 [565399]: req_id = self.send(method, params) May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 145, in send May 28 05:18:34 argo023 [565399]: self.flush() May 28 05:18:34 argo023 [565399]: File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 141, in flush May 28 05:18:34 argo023 [565399]: self.sock.sendall(reqstr.encode("utf-8")) May 28 05:18:34 argo023 [565399]: AttributeError: 'NoneType' object has no attribute 'sendall' May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR prometheus.py:201: Exporter failed to fetch connection info for nqn.2016-06.io.spdk:cnode2: Failure listing connections, can't get qpairs: May 28 05:18:34 argo023 [565399]: 'NoneType' object has no attribute 'sendall' May 28 05:18:34 argo023 [565399]: ceph version 18.2.1-190.el9cp (5eee8f17de7cfe7a752abc74828d97473040534e) reef (stable) May 28 05:18:34 argo023 [565399]: 1: /lib64/libc.so.6(+0x3e6f0) [0x7f364f3816f0] May 28 05:18:34 argo023 [565399]: 2: /lib64/libc.so.6(+0x8679a) [0x7f364f3c979a] May 28 05:18:34 argo023 [565399]: 3: pthread_cond_wait() May 28 05:18:34 argo023 [565399]: 4: (std::condition_variable::wait(std::unique_lock<std::mutex>&)+0x10) [0x7f364f7196b0] May 28 05:18:34 argo023 [565399]: 5: (AsyncMessenger::wait()+0x137) [0x7f36501d6957] May 28 05:18:34 argo023 [565399]: 6: (NVMeofGwMonitorClient::main(std::vector<char const*, std::allocator<char const*> >)+0x18) [0x5591765b1fb8] May 28 05:18:34 argo023 [565399]: 7: main() May 28 05:18:34 argo023 [565399]: 8: /lib64/libc.so.6(+0x29590) [0x7f364f36c590] May 28 05:18:34 argo023 [565399]: 9: __libc_start_main() May 28 05:18:34 argo023 [565399]: 10: _start() May 28 05:18:34 argo023 systemd-coredump[566135]: Process 565423 (ceph-nvmeof-mon) of user 0 dumped core. May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] INFO server.py:427: Aborting (client.nvmeof.nvmeof_pool.argo023.qcoxrd) pid 82...
We prepared a fix to the SPDK segmentation fault. The fix will be included in new gw ds images (1.2.13). Fix is ready here - https://github.com/ceph/spdk/commit/07e69f8b0eb1ae4b4f7e675a00ea6a8fbe061666 and downstream spdk build will include it.
Issue not seen in any of our clusters running ceph version 18.2.1-191.el9cp. And using NVMe CLI version nvmeof-cli-rhel9:1.2.13-1. Hence marking this BZ as verified.