Bug 2282839 - [NVMe-7.1-Tracker] [NVMe HA] NVMe gateway service fails after listener and host addition in latest RC build
Summary: [NVMe-7.1-Tracker] [NVMe HA] NVMe gateway service fails after listener and ho...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: NVMeOF
Version: 7.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 7.1
Assignee: Aviv Caro
QA Contact: Manasa
ceph-doc-bot
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-05-23 07:01 UTC by Manasa
Modified: 2025-04-12 08:28 UTC (History)
5 users (show)

Fixed In Version: ceph-nvmeof-container-1.2.13-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-05-23 16:58:48 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-9083 0 None None None 2024-05-23 07:11:39 UTC

Description Manasa 2024-05-23 07:01:58 UTC
Description of problem:
NVMe gateway service fails after listener and host addition in latest RC build

Version-Release number of selected component (if applicable):
ceph 7.1 RC Build - 18.2.1-188

How reproducible:
Tried once

Steps to Reproduce:
1. Deploy 4 NVME gateways on the ceph cluster.
2. Create subsystems on the gateway
3. Add listeners and host access to the gateway

The following sequence of steps were performed to add listeners and host access. Gateway kept going down and coming back up,but during host addition gateway went into a failure state and never came back up.

[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 subsystem add -n nqn.2016-06.io.spdk:cnode1 -s 1 -m 400
Adding subsystem nqn.2016-06.io.spdk:cnode1: Successful

[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 subsystem add -n nqn.2016-06.io.spdk:cnode2 -s 2 -m 400
Adding subsystem nqn.2016-06.io.spdk:cnode2: Successful

[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 subsystem add -n nqn.2016-06.io.spdk:cnode3 -s 3 -m 400
Adding subsystem nqn.2016-06.io.spdk:cnode3: Successful

[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 subsystem add -n nqn.2016-06.io.spdk:cnode4 -s 4 -m 400
Adding subsystem nqn.2016-06.io.spdk:cnode4: Successful



[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode1 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420
Adding nqn.2016-06.io.spdk:cnode1 listener at 10.0.210.138:4420: Successful

[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode2 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420
Failure adding nqn.2016-06.io.spdk:cnode2 listener at 10.0.210.138:4420:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-23T03:44:29.136318146+00:00", grpc_status:14}"
>

[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420
Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-05-23T03:44:34.843914196+00:00"}"
>

[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420
Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-23T03:45:02.714212308+00:00", grpc_status:14}"
>

[root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof
  ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service                  loaded activating start     start Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20


[root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof
  ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service                  loaded active running   Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20


[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode2 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420
Adding nqn.2016-06.io.spdk:cnode2 listener at 10.0.210.138:4420: Successful

[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420
Failure adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {created_time:"2024-05-23T03:45:45.035296346+00:00", grpc_status:14}"
>


[root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof
  ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service                  loaded active running   Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20


[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode3 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420


Adding nqn.2016-06.io.spdk:cnode3 listener at 10.0.210.138:4420: Successful
[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 listener add -n nqn.2016-06.io.spdk:cnode4 -t ceph-mytest-qqmwz8-node4 -a 10.0.210.138 -s 4420


Adding nqn.2016-06.io.spdk:cnode4 listener at 10.0.210.138:4420: Successful
[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 host add -n nqn.2016-06.io.spdk:cnode1 -t "*"

Allowing open host access to nqn.2016-06.io.spdk:cnode1: Successful
[root@ceph-mytest-qqmwz8-node4 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.9-2 --server-address 10.0.210.138 --server-port 5500 host add -n nqn.2016-06.io.spdk:cnode2 -t "*"
Failure allowing open host access to nqn.2016-06.io.spdk:cnode2:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.0.210.138:5500: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-05-23T03:46:30.059226616+00:00"}"
>


[root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof
  ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service                  loaded activating auto-restart Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20

[root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof
● ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service                  loaded failed failed    Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20

[root@ceph-mytest-qqmwz8-node4 ~]# systemctl |grep nvmeof
● ceph-c4522f3e-18ae-11ef-b7f3-fa163e5f9d20.ceph-mytest-qqmwz8-node4.ysrssb.service                  loaded failed failed    Ceph nvmeof.nvmeof.ceph-mytest-qqmwz8-node4.ysrssb for c4522f3e-18ae-11ef-b7f3-fa163e5f9d20

Actual results:
Gateway service failed after trying to add listeners and hosts

Expected results:
Gateway service doesn't fail

Additional info:
Attaching complete journalctl logs of the failed gateway.

Comment 4 Orit Wasserman 2024-05-23 16:58:48 UTC

*** This bug has been marked as a duplicate of bug 2281800 ***

Comment 5 Manasa 2024-05-27 09:36:10 UTC
The error seen in this bug, is again seen post upgrade to ceph version ceph-18.2.1-190.el9cp.

Gateway came up to running state once upgrade completed. However gateway again went to failed state while I tried to list the listeners.

Attached the nvme service log for the failed gateway.

Reopening this bug as per discussion wit Aviv. Also I have provided cluster access to the team.

Comment 7 Aviv Caro 2024-05-28 05:48:06 UTC
We can clearly see now the spdk segmentation fault in the rpc call to get qpairs. We are also able to reproduce it in a unit test in dev team. We prepare to options for 7.1: 
- Option 1 (ready in GW 1.2.11, and waiting for downstream build), is to avoid calling this qpairs method. It means that some CLI and statistics will be missing, but it is not a critical functionality. 
- Option 2 (a fix is being tested in dev now), is to fix the segmentation fault in SPDK. This is better because this way we don't need to remove any functionality. But this means that build team needs to take this fix and rebuild SPDK. I assume this can be done very quickly. Need to verify with build team (assuming our fix works).

Comment 8 Manasa 2024-05-28 05:55:44 UTC
Able to reproduce this issue while adding listeners to a gateway in another cluster as well. Attaching the nvme service logs for this cluster as well here.

Attachment name - nvme_svc_newgw_log.txt

Comment 10 Manasa 2024-05-28 06:00:42 UTC
Also, while performing leader mon kill (removing the leader mon from ceph cluster), we observed that one of the gateways which was working before mon kill, starts toggling between on and off states for a while. We see that the error seen in the service log is similar to the error reported here. So maybe this could also be the same issue described above.

Attaching the logs for this issue as well here with name - nvme_log_leadermonkill_scenario.txt

Comment 12 harika chebrolu 2024-05-28 06:47:19 UTC
I am also facing similar issue in my scale cluster.
Steps I followed:
1. Created 8 subsystems, configured listeners and hosts
2. Adding 400 namespaces for each subsystem and run IO into them.

Error:
Initially, when adding namespaces to the second namespace, it didn't work. However, after trying again later, it worked as expected. Then, it failed again when adding the fifth subsystem and when I see the error logs looks like its related to qpairs issue

`
May 27 14:41:05 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:05] INFO grpc.py:760: Received request to add bdev_5c77a3fd-1b89-4002-a38b-74ae910534c6 to nqn.2016-06.io.spdk:cnode4 with ANA group id 2 using NSID 50 and UUID 5c77a3fd-1b89-4002-a38b-74ae910534c6
May 27 14:41:27 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:27] INFO grpc.py:1232: Received request to list all namespaces for nqn.2016-06.io.spdk:cnode4, context: <grpc._server._Context object at 0x7f6fb77bfe80>, client address: IPv4 10.88.0.215:56624
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR grpc.py:1967: Failure listing connections, can't get qpairs
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: Traceback (most recent call last):
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 1963, in list_connections_safe
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     qpair_ret = rpc_nvmf.nvmf_subsystem_get_qpairs(self.spdk_rpc_client, nqn=request.subsystem)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 637, in nvmf_subsystem_get_qpairs
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     return client.call('nvmf_subsystem_get_qpairs', params)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 196, in call
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     raise e
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     response = self.recv()
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 170, in recv
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     raise JSONRPCException("Connection closed with partial response:\n%s\n" % self._recv_buf)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: spdk.rpc.client.JSONRPCException: Connection closed with partial response:
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: 
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: 
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: 
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR server.py:42: GatewayServer: SIGCHLD received signum=17
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR prometheus.py:201: Exporter failed to fetch connection info for nqn.2016-06.io.spdk:cnode7: Failure listing connections, can't get qpairs: Connection closed with partial response:
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: 
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: 
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: 
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR grpc.py:2457: get_subsystems failed
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: Traceback (most recent call last):
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2455, in get_subsystems_safe
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     ret = rpc_nvmf.nvmf_get_subsystems(self.spdk_rpc_client)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 177, in nvmf_get_subsystems
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     return client.call('nvmf_get_subsystems', params)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 187, in call
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     req_id = self.send(method, params)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 145, in send
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     self.flush()
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 141, in flush
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     self.sock.sendall(reqstr.encode("utf-8"))
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: AttributeError: 'NoneType' object has no attribute 'sendall'
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR grpc.py:1967: Failure listing connections, can't get qpairs
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: Traceback (most recent call last):
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 1963, in list_connections_safe
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     qpair_ret = rpc_nvmf.nvmf_subsystem_get_qpairs(self.spdk_rpc_client, nqn=request.subsystem)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 637, in nvmf_subsystem_get_qpairs
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     return client.call('nvmf_subsystem_get_qpairs', params)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 187, in call
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     req_id = self.send(method, params)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 145, in send
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     self.flush()
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 141, in flush
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     self.sock.sendall(reqstr.encode("utf-8"))
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: AttributeError: 'NoneType' object has no attribute 'sendall'
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR server.py:46: PID of terminated child process is 82
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR prometheus.py:201: Exporter failed to fetch connection info for nqn.2016-06.io.spdk:cnode8: Failure listing connections, can't get qpairs:
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: 'NoneType' object has no attribute 'sendall'
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: [27-May-2024 14:41:48] ERROR server.py:110: GatewayServer exception occurred:
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: Traceback (most recent call last):
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 521, in _ping
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     ret = spdk.rpc.spdk_get_version(self.spdk_rpc_ping_client)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/__init__.py", line 78, in spdk_get_version
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     return client.call('spdk_get_version')
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     response = self.recv()
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 166, in recv
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]:     newdata = self.sock.recv(4096)
May 27 14:41:48 tala018 ceph-87e56316-1c07-11ef-9cd2-b4835101e4e4-nvmeof-nvmeof_pool-tala018-pxrker[205306]: ConnectionResetError: [Errno 104] Connection reset by peer
`

Comment 13 Rahul Lepakshi 2024-05-28 07:50:05 UTC
+ 1

Even I am observing same issue along with similar issues which will root cause to same issue again

bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_0 elapsed time
bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_1 elapsed time
bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_2 elapsed time
bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_3 elapsed time
bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_4 elapsed time
bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_5 elapsed time
bdev_rbd.c:1039:_bdev_rbd_wait_for_latest_osdmap: *ERROR*: rados_wait_for_latest_osdmap cluster: cluster_context_1_6 elapsed time


May 27 12:18:42 argo023 [201546]: [2024-05-27 12:18:42.300198] tcp.c:1574:nvmf_tcp_qpair_set_recv_state: *ERROR*: The recv state of tqpair=0xbccc820 is same with the state(5) to be set
May 27 12:18:42 argo023 [201546]: [2024-05-27 12:18:42.300240] tcp.c:1574:nvmf_tcp_qpair_set_recv_state: *ERROR*: The recv state of tqpair=0xbccc820 is same with the state(5) to be set
May 27 12:18:51 argo023 [201546]: [2024-05-27 12:18:51.238527] ctrlr.c: 178:nvmf_ctrlr_keep_alive_poll: *NOTICE*: Disconnecting host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0056-3610-8047-b8c04f525433 from subsystem nqn.2016-06.io.spdk:cnode2 due to keep alive timeout.
May 27 12:19:19 argo023 [201546]: Traceback (most recent call last):
May 27 12:19:19 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2056, in list_connections_safe
May 27 12:19:19 argo023 [201546]:     one_conn = pb2.connection(nqn=hostnqn, connected=True,
May 27 12:19:19 argo023 [201546]: ValueError: unknown enum label ""
May 27 12:27:49 argo023 [201546]: During handling of the above exception, another exception occurred:
May 27 12:27:49 argo023 [201546]: Traceback (most recent call last):
May 27 12:27:49 argo023 [201546]:   File "/usr/lib64/python3.9/wsgiref/handlers.py", line 137, in run
May 27 12:27:49 argo023 [201546]:     self.result = application(self.environ, self.start_response)
May 27 12:27:49 argo023 [201546]:   File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 128, in prometheus_app
May 27 12:27:49 argo023 [201546]:     status, headers, output = _bake_output(registry, accept_header, accept_encoding_header, params, disable_compression)
May 27 12:27:49 argo023 [201546]:   File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 104, in _bake_output
May 27 12:27:49 argo023 [201546]:     output = encoder(registry)
May 27 12:27:49 argo023 [201546]:   File "/usr/local/lib/python3.9/site-packages/prometheus_client/openmetrics/exposition.py", line 21, in generate_latest
May 27 12:27:49 argo023 [201546]:     for metric in registry.collect():
May 27 12:27:49 argo023 [201546]:   File "/usr/local/lib/python3.9/site-packages/prometheus_client/registry.py", line 97, in collect
May 27 12:27:49 argo023 [201546]:     yield from collector.collect()
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 43, in wrapped
May 27 12:27:49 argo023 [201546]:     for metric in method(self, *args, **kwargs):
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 225, in collect
May 27 12:27:49 argo023 [201546]:     self._get_data()
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 213, in _get_data
May 27 12:27:49 argo023 [201546]:     self.connections = self._get_connection_map(self.subsystems)
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 57, in call
May 27 12:27:49 argo023 [201546]:     result = method(self, *args, **kwargs)
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/prometheus.py", line 199, in _get_connection_map
May 27 12:27:49 argo023 [201546]:     resp = self.gateway_rpc.list_connections(pb2.list_connections_req(subsystem=subsys.nqn))
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2075, in list_connections
May 27 12:27:49 argo023 [201546]:     return self.execute_grpc_function(self.list_connections_safe, request, context)
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 264, in execute_grpc_function
May 27 12:27:49 argo023 [201546]:     return self.omap_lock.execute_omap_locking_function(self._grpc_function_with_lock, func, request, context)
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/state.py", line 253, in execute_omap_locking_function
May 27 12:27:49 argo023 [201546]:     return grpc_func(omap_locking_func, request, context)
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 256, in _grpc_function_with_lock
May 27 12:27:49 argo023 [201546]:     return func(request, context)
May 27 12:27:49 argo023 [201546]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2063, in list_connections_safe
May 27 12:27:49 argo023 [201546]:     self.logger.exception(f"{con=} parse error")
May 27 12:27:49 argo023 [201546]: NameError: name 'con' is not defined

May 27 11:48:39 argo023 [30679]: 4c2a-a6fb-a47c6cbf4127', 'nguid': '5D853CF96C714C2AA6FBA47C6CBF4127', 'uuid': '5d853cf9-6c71-4c2a-a6fb-a47c6cbf4127', 'anagrpid': 2}, {'nsid': 79, 'bdev_name': 'bdev_0b132b8e-b7e6-43e5-a343-50e770796cb1', 'name': 'bdev_0b132b8e-b7e6-43e5-a343-50e770796cb1', 'nguid': '0B132B8EB7E643E5A34350E770796CB1', 'uuid': '0b132b8e-b7e6-43e5-a343-50e770796cb1', 'anagrpid': 1}, {'nsid': 80, 'bdev_name': 'bdev_b7ec2076-cb6a-41e9-a5fa-77ed05a3b418', 'name': 'bdev_b7ec2076-cb6a-41e9-a5fa-77ed05a3b418', 'nguid': 'B7EC2076CB6A41E9A5FA77ED05A3B418', 'uuid': 'b7ec2076-cb6a-41e9-a5fa-77ed05a3b418', 'anagrpid': 2}, {'nsid': 81, 'bdev_name': 'bdev_75a4082e-7777-4f5f-b8c3-365536a85e26', 'name': 'bdev_75a4082e-7777-4f5f-b8c3-365536a85e26', 'nguid': '75A4082E77774F5FB8C3365536A85E26', 'uuid': '75a4082e-7777-4f5f-b8c3-365536a85e26', 'anagrpid': 1}, {'nsid': 82, 'bdev_name': 'bdev_cb49859b-e787-4f85-bd7e-25c817bfab19', 'name': 'bdev_cb49859b-e787-4f85-bd7e-25c817bfab19', 'nguid': 'CB49859BE7874F85BD7E25C817BFAB19', 'uuid': 'cb49859b-e787-4f85-bd7e-25c817bfab19', 'anagrpid': 2}, {'nsid': 83, 'bdev_name': 'bdev_19bc99fa-e652-47f8-90d2-dd5e74e70e70', 'name': 'bdev_19bc99fa-e652-47f8-90d2-dd5e74e70e70', 'nguid': '19BC99FAE65247F890D2DD5E74E70E70', 'uuid': '19bc99fa-e652-47f8-90d2-dd5e74e70e70', 'anagrpid': 1}, {'nsid': 84, 'bdev_name': 'bdev_a94840ff-0670-474c-b16f-573062948433', 'name': 'bdev_a94840ff-0670-474c-b16f-573062948433', 'nguid': 'A94840FF0670474CB16F573062948433', 'uuid': 'a94840ff-0670-474c-b16f-573062948433', 'anagrpid': 2}, {'nsid': 85, 'bdev_name': 'bdev_13cb0586-87f4-49ec-a765-23ddeb89a948', 'name': 'bdev_13cb0586-87f4-49ec-a765-23ddeb89a948', 'nguid': '13CB058687F449ECA76523DDEB89A948', 'uuid': '13cb0586-87f4-49ec-a765-23ddeb89a948', 'anagrpid': 1}, {'nsid': 86, 'bdev_name': 'bdev_517eb881-b9e8-4183-a45f-43ce3b13f5d5', 'name': 'bdev_517eb881-b9e8-4183-a45f-43ce3b13f5d5', 'nguid': '517EB881B9E84183A45F43CE3B13F5D5', 'uuid': '517eb881-b9e8-4183-a45f-43ce3b13f5d5', 'anagrpid': 2}, {'nsid': 87, 'bdev_name': 'bdev_60760af1-fe3d-440a-83e2-8f4114eeddb9', 'name': 'bdev_60760af1-fe3d-440a-83e2-8f4114eeddb9', 'nguid': '60760AF1FE3D440A83E28F4114EEDDB9', 'uuid': '60760af1-fe3d-440a-83e2-8f4114eeddb9', 'anagrpid': 1}, {'nsid': 88, 'bdev_name': 'bdev_dcec734a-9886-4766-97df-d85f68d82a9b', 'name': 'bdev_dcec734a-9886-4766-97df-d85f68d82a9b', 'nguid': 'DCEC734A9886476697DFD85F68D82A9B', 'uuid': 'dcec734a-9886-4766-97df-d85f68d82a9b', 'anagrpid': 2}, {'nsid': 89, 'bdev_name': 'bdev_e2b05ab2-add0-41d8-8ff6-071dc289abfd', 'name': 'bdev_e2b05ab2-add0-41d8-8ff6-071dc289abfd', 'nguid': 'E2B05AB2ADD041D88FF6071DC289ABFD', 'uuid': 'e2b05ab2-add0-41d8-8ff6-071dc289abfd', 'anagrpid': 1}, {'nsid': 90, 'bdev_name': 'bdev_5e73c16b-9860-4e95-b6b3-0e3e693e1af0', 'name': 'bdev_5e73c16b-9860-4e95-b6b3-0e3e693e1af0', 'nguid': '5E73C16B98604E95B6B30E3E693E1AF0', 'uuid': '5e73c16b-9860-4e95-b6b3-0e3e693e1af0', 'anagrpid': 2}, {'nsid': 91, 'bdev_name': 'bdev_34aa1267-e173-43bb-92c7-f1115777780c', 'name': 'bdev_34aa1267-e173-43bb-92c7-f1115777780c', 'nguid': '34AA1267E17343BB92C7F1115777780C', 'uuid': '34aa1267-e173-43bb-92c7-f1115777780c', 'anagrpid': 1}, {'nsid': 92, 'bdev_name': 'bdev_8e60b104-8b4c-4d36-80a5-af9cfaf976d6', 'name': 'bdev_8e60b104-8b4c-4d36-80a5-af9cfaf976d6', 'nguid': '8E60B1048B4C4D3680A5AF9CFAF976D6', 'uuid': '8e60b104-8b4c-4d36-80a5-af9cfaf976d6', 'anagrpid': 2}, {'nsid': 93, 'bdev_name': 'bdev_568247e6-4cf3-4200-849e-bf5f714ca877', 'name': 'bdev_568247e6-4cf3-4200-849e-bf5f714ca877', 'nguid': '568247E64CF34200849EBF5F714CA877', 'uuid': '568247e6-4cf3-4200-849e-bf5f714ca877', 'anagrpid': 1}, {'nsid': 94, 'bdev_name': 'bdev_46615ebf-2a5a-43f1-9070-22be877a9346', 'name': 'bdev_46615ebf-2a5a-43f1-9070-22be877a9346', 'nguid': '46615EBF2A5A43F1907022BE877A9346', 'uuid': '46615ebf-2a5a-43f1-9070-22be877a9346', 'anagrpid': 2}, {'nsid': 95, 'bdev_name': 'bdev_764372d4-6551-4d16-9d92-22d387b3e9d7', 'name': 'bdev_764372d4-6551-4d16-9d92-22d387b3e9d7', 'nguid': '764372D465514D169D9222D387B3E9D7', 'uuid': '764372d4-6551-4d16-9d92-22d387b3e9d7', 'anagrpid': 1}, {[27-May-2024 11:48:39] DEBUG grpc.py:2083: Subsystem nqn.2016-06.io.spdk:cnode1 enable_ha: True
May 27 11:48:39 argo023 [30679]: [27-May-2024 11:48:39] DEBUG grpc.py:2083: Subsystem nqn.2016-06.io.spdk:cnode2 enable_ha: True
May 27 11:48:39 argo023 [30679]: [27-May-2024 11:48:39] DEBUG grpc.py:1961: Received request to list connections for nqn.2016-06.io.spdk:cnode1, context: None
May 27 11:48:46 argo023 [30679]: [27-May-2024 11:48:46] ERROR server.py:42: GatewayServer: SIGCHLD received signum=17
May 27 11:48:46 argo023 [30679]: [27-May-2024 11:48:46] ERROR server.py:46: PID of terminated child process is 133
May 27 11:48:46 argo023 [30679]: [27-May-2024 11:48:46] ERROR server.py:110: GatewayServer exception occurred:
May 27 11:48:46 argo023 [30679]: Traceback (most recent call last):
May 27 11:48:46 argo023 [30679]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 521, in _ping
May 27 11:48:46 argo023 [30679]:     ret = spdk.rpc.spdk_get_version(self.spdk_rpc_ping_client)
May 27 11:48:46 argo023 [30679]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/__init__.py", line 78, in spdk_get_version
May 27 11:48:46 argo023 [30679]:     return client.call('spdk_get_version')
May 27 11:48:46 argo023 [30679]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call
May 27 11:48:46 argo023 [30679]:     response = self.recv()
May 27 11:48:46 argo023 [30679]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 166, in recv
May 27 11:48:46 argo023 [30679]:     newdata = self.sock.recv(4096)
May 27 11:48:46 argo023 [30679]: ConnectionResetError: [Errno 104] Connection reset by peer


May 28 05:17:36 argo023 [565399]: [28-May-2024 05:17:36] INFO grpc.py:2099: Received request to create argo023 TCP ipv4 listener for nqn.2016-06.io.spdk:cnode2 at 10.8.128.223:4420, context: None
May 28 05:17:56 argo023 [565399]: [2024-05-28 05:17:56.134518] ctrlr.c: 178:nvmf_ctrlr_keep_alive_poll: *NOTICE*: Disconnecting host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0056-3610-8047-b8c04f525433 from subsystem nqn.2016-06.io.spdk:cnode2 due to keep alive timeout.
May 28 05:18:13 argo023 [565399]: [2024-05-28 05:18:13.021165] ctrlr.c: 178:nvmf_ctrlr_keep_alive_poll: *NOTICE*: Disconnecting host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0056-3610-8047-b8c04f525433 from subsystem nqn.2016-06.io.spdk:cnode1 due to keep alive timeout.
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR server.py:42: GatewayServer: SIGCHLD received signum=17
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR server.py:46: PID of terminated child process is 82
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR grpc.py:1967: Failure listing connections, can't get qpairs
May 28 05:18:34 argo023 [565399]: Traceback (most recent call last):
May 28 05:18:34 argo023 [565399]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 1963, in list_connections_safe
May 28 05:18:34 argo023 [565399]:     qpair_ret = rpc_nvmf.nvmf_subsystem_get_qpairs(self.spdk_rpc_client, nqn=request.subsystem)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 637, in nvmf_subsystem_get_qpairs
May 28 05:18:34 argo023 [565399]:     return client.call('nvmf_subsystem_get_qpairs', params)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 196, in call
May 28 05:18:34 argo023 [565399]:     raise e
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call
May 28 05:18:34 argo023 [565399]:     response = self.recv()
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 170, in recv
May 28 05:18:34 argo023 [565399]:     raise JSONRPCException("Connection closed with partial response:\n%s\n" % self._recv_buf)
May 28 05:18:34 argo023 [565399]: spdk.rpc.client.JSONRPCException: Connection closed with partial response:
May 28 05:18:34 argo023 [565399]: 
May 28 05:18:34 argo023 [565399]: 
May 28 05:18:34 argo023 [565399]: 
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR server.py:110: GatewayServer exception occurred:
May 28 05:18:34 argo023 [565399]: Traceback (most recent call last):
May 28 05:18:34 argo023 [565399]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 521, in _ping
May 28 05:18:34 argo023 [565399]:     ret = spdk.rpc.spdk_get_version(self.spdk_rpc_ping_client)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/__init__.py", line 78, in spdk_get_version
May 28 05:18:34 argo023 [565399]:     return client.call('spdk_get_version')
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 189, in call
May 28 05:18:34 argo023 [565399]:     response = self.recv()
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 166, in recv
May 28 05:18:34 argo023 [565399]:     newdata = self.sock.recv(4096)
May 28 05:18:34 argo023 [565399]: ConnectionResetError: [Errno 104] Connection reset by peer
May 28 05:18:34 argo023 [565399]: 
May 28 05:18:34 argo023 [565399]: During handling of the above exception, another exception occurred:
May 28 05:18:34 argo023 [565399]: 
May 28 05:18:34 argo023 [565399]: Traceback (most recent call last):
May 28 05:18:34 argo023 [565399]:   File "/remote-source/ceph-nvmeof/app/control/__main__.py", line 45, in <module>
May 28 05:18:34 argo023 [565399]:     gateway.keep_alive()
May 28 05:18:34 argo023 [565399]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 508, in keep_alive
May 28 05:18:34 argo023 [565399]:     alive = self._ping()
May 28 05:18:34 argo023 [565399]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 524, in _ping
May 28 05:18:34 argo023 [565399]:     self.logger.exception(f"spdk_get_version failed")
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1481, in exception
May 28 05:18:34 argo023 [565399]:     self.error(msg, *args, exc_info=exc_info, **kwargs)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1475, in error
May 28 05:18:34 argo023 [565399]:     self._log(ERROR, msg, args, **kwargs)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1589, in _log
May 28 05:18:34 argo023 [565399]:     self.handle(record)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1599, in handle
May 28 05:18:34 argo023 [565399]:     self.callHandlers(record)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1661, in callHandlers
May 28 05:18:34 argo023 [565399]:     hdlr.handle(record)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 952, in handle
May 28 05:18:34 argo023 [565399]:     self.emit(record)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/handlers.py", line 73, in emit
May 28 05:18:34 argo023 [565399]:     if self.shouldRollover(record):
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/handlers.py", line 194, in shouldRollover
May 28 05:18:34 argo023 [565399]:     msg = "%s\n" % self.format(record)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 927, in format
May 28 05:18:34 argo023 [565399]:     return fmt.format(record)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 671, in format
May 28 05:18:34 argo023 [565399]:     record.exc_text = self.formatException(record.exc_info)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/logging/__init__.py", line 621, in formatException
May 28 05:18:34 argo023 [565399]:     traceback.print_exception(ei[0], ei[1], tb, None, sio)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/traceback.py", line 103, in print_exception
May 28 05:18:34 argo023 [565399]:     for line in TracebackException(
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/traceback.py", line 517, in __init__
May 28 05:18:34 argo023 [565399]:     self.stack = StackSummary.extract(
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/traceback.py", line 366, in extract
May 28 05:18:34 argo023 [565399]:     f.line
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/traceback.py", line 288, in line
May 28 05:18:34 argo023 [565399]:     self._line = linecache.getline(self.filename, self.lineno).strip()
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/linecache.py", line 30, in getline
May 28 05:18:34 argo023 [565399]:     lines = getlines(filename, module_globals)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/linecache.py", line 46, in getlines
May 28 05:18:34 argo023 [565399]:     return updatecache(filename, module_globals)
May 28 05:18:34 argo023 [565399]:   File "/usr/lib64/python3.9/linecache.py", line 93, in updatecache
May 28 05:18:34 argo023 [565399]:     stat = os.stat(fullname)
May 28 05:18:34 argo023 [565399]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 55, in sigchld_handler
May 28 05:18:34 argo023 [565399]:     raise SystemExit(f"Gateway subprocess terminated {pid=} {exit_code=}")
May 28 05:18:34 argo023 [565399]: SystemExit: Gateway subprocess terminated pid=82 exit_code=-11
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR prometheus.py:201: Exporter failed to fetch connection info for nqn.2016-06.io.spdk:cnode1: Failure listing connections, can't get qpairs: Connection closed with partial response:
May 28 05:18:34 argo023 [565399]: 
May 28 05:18:34 argo023 [565399]: 
May 28 05:18:34 argo023 [565399]: 
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR grpc.py:2457: get_subsystems failed
May 28 05:18:34 argo023 [565399]: Traceback (most recent call last):
May 28 05:18:34 argo023 [565399]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 2455, in get_subsystems_safe
May 28 05:18:34 argo023 [565399]:     ret = rpc_nvmf.nvmf_get_subsystems(self.spdk_rpc_client)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 177, in nvmf_get_subsystems
May 28 05:18:34 argo023 [565399]:     return client.call('nvmf_get_subsystems', params)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 187, in call
May 28 05:18:34 argo023 [565399]:     req_id = self.send(method, params)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 145, in send
May 28 05:18:34 argo023 [565399]:     self.flush()
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 141, in flush
May 28 05:18:34 argo023 [565399]:     self.sock.sendall(reqstr.encode("utf-8"))
May 28 05:18:34 argo023 [565399]: AttributeError: 'NoneType' object has no attribute 'sendall'
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] INFO server.py:427: Aborting (client.nvmeof.nvmeof_pool.argo023.qcoxrd) pid 19...
May 28 05:18:34 argo023 [565399]: *** Caught signal (Aborted) **
May 28 05:18:34 argo023 [565399]:  in thread 7f364e1155c0 thread_name:ceph-nvmeof-mon
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR grpc.py:1967: Failure listing connections, can't get qpairs
May 28 05:18:34 argo023 [565399]: Traceback (most recent call last):
May 28 05:18:34 argo023 [565399]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 1963, in list_connections_safe
May 28 05:18:34 argo023 [565399]:     qpair_ret = rpc_nvmf.nvmf_subsystem_get_qpairs(self.spdk_rpc_client, nqn=request.subsystem)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/nvmf.py", line 637, in nvmf_subsystem_get_qpairs
May 28 05:18:34 argo023 [565399]:     return client.call('nvmf_subsystem_get_qpairs', params)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 187, in call
May 28 05:18:34 argo023 [565399]:     req_id = self.send(method, params)
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 145, in send
May 28 05:18:34 argo023 [565399]:     self.flush()
May 28 05:18:34 argo023 [565399]:   File "/usr/local/lib/python3.9/site-packages/spdk/rpc/client.py", line 141, in flush
May 28 05:18:34 argo023 [565399]:     self.sock.sendall(reqstr.encode("utf-8"))
May 28 05:18:34 argo023 [565399]: AttributeError: 'NoneType' object has no attribute 'sendall'
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] ERROR prometheus.py:201: Exporter failed to fetch connection info for nqn.2016-06.io.spdk:cnode2: Failure listing connections, can't get qpairs:
May 28 05:18:34 argo023 [565399]: 'NoneType' object has no attribute 'sendall'
May 28 05:18:34 argo023 [565399]:  ceph version 18.2.1-190.el9cp (5eee8f17de7cfe7a752abc74828d97473040534e) reef (stable)
May 28 05:18:34 argo023 [565399]:  1: /lib64/libc.so.6(+0x3e6f0) [0x7f364f3816f0]
May 28 05:18:34 argo023 [565399]:  2: /lib64/libc.so.6(+0x8679a) [0x7f364f3c979a]
May 28 05:18:34 argo023 [565399]:  3: pthread_cond_wait()
May 28 05:18:34 argo023 [565399]:  4: (std::condition_variable::wait(std::unique_lock<std::mutex>&)+0x10) [0x7f364f7196b0]
May 28 05:18:34 argo023 [565399]:  5: (AsyncMessenger::wait()+0x137) [0x7f36501d6957]
May 28 05:18:34 argo023 [565399]:  6: (NVMeofGwMonitorClient::main(std::vector<char const*, std::allocator<char const*> >)+0x18) [0x5591765b1fb8]
May 28 05:18:34 argo023 [565399]:  7: main()
May 28 05:18:34 argo023 [565399]:  8: /lib64/libc.so.6(+0x29590) [0x7f364f36c590]
May 28 05:18:34 argo023 [565399]:  9: __libc_start_main()
May 28 05:18:34 argo023 [565399]:  10: _start()
May 28 05:18:34 argo023 systemd-coredump[566135]: Process 565423 (ceph-nvmeof-mon) of user 0 dumped core.
May 28 05:18:34 argo023 [565399]: [28-May-2024 05:18:34] INFO server.py:427: Aborting (client.nvmeof.nvmeof_pool.argo023.qcoxrd) pid 82...

Comment 14 Aviv Caro 2024-05-28 15:40:23 UTC
We prepared a fix to the SPDK segmentation fault. The fix will be included in new gw ds images (1.2.13). Fix is ready here - https://github.com/ceph/spdk/commit/07e69f8b0eb1ae4b4f7e675a00ea6a8fbe061666 and downstream spdk build will include it.

Comment 15 Manasa 2024-06-04 04:51:08 UTC
Issue not seen in any of our clusters running ceph version 18.2.1-191.el9cp. And using NVMe CLI version nvmeof-cli-rhel9:1.2.13-1.
Hence marking this BZ as verified.


Note You need to log in before you can comment on or make changes to this bug.