Description of problem: Unable to discover subsystems from Initiator cali012(10.8.130.12) on GW node cali010(10.8.130.10) creds:(root/passwd) [ceph: root@cali001 /]# ceph version ceph version 18.2.0-65.el9cp (93750e88ac3c3ae3750b3ab90c8b2d48c387fb5f) reef (stable) [ceph: root@cali001 /]# ceph orch host ls HOST ADDR LABELS STATUS cali001 10.8.130.1 _admin,osd,mgr,installer,mon cali004 10.8.130.4 osd,mgr,mon cali005 10.8.130.5 osd,mon cali008 10.8.130.8 osd cali010 10.8.130.10 osd,nvmeof-gw 5 hosts in cluster [root@cali012 ~]# nvme discover --transport tcp --traddr 10.8.130.10 --trsvcid 5001 failed to add controller, error Unknown error -1 [root@cali010 log]# podman run registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:0.0.3-1 --server-address 10.8.130.10 --server-port 5500 get_subsystems INFO:__main__:Get subsystems: [ { "nqn": "nqn.2016-06.io.spdk:cnode1", "subtype": "NVMe", "listen_addresses": [ { "transport": "TCP", "trtype": "TCP", "adrfam": "IPv4", "traddr": "10.8.130.10", "trsvcid": "5001" } ], "allow_any_host": true, "hosts": [], "serial_number": "1", "model_number": "SPDK bdev Controller", "max_namespaces": 256, "min_cntlid": 1, "max_cntlid": 65519, "namespaces": [] } ] From /var/log/messages on initiator node Sep 29 12:10:00 cali012 kernel: nvme nvme1: Connect Invalid Data Parameter, subsysnqn "nqn.2014-08.org.nvmexpress.discovery" Sep 29 12:10:00 cali012 kernel: nvme nvme1: failed to connect queue: 0 ret=386 Version-Release number of selected component (if applicable): But when I try discovery for a cluster with ceph-nvmeof:0.0.3 version below is kernel message and discovery is successful - Sep 29 12:11:14 cali012 kernel: nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.206.117:5001 Sep 29 12:11:14 cali012 kernel: nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" [root@cali012 ~]# nvme discover -t tcp -a 10.0.206.117 -s 5001 Discovery Log Number of Records 1, Generation counter 2 =====Discovery Log Entry 0====== trtype: tcp adrfam: ipv4 subtype: nvme subsystem treq: not required portid: 0 trsvcid: 5001 subnqn: nqn.2016-06.io.spdk:cnode1 traddr: 10.0.206.117 eflags: not specified sectype: none creds - (cephuser/cephuser) # ceph orch host ls HOST ADDR LABELS STATUS ceph-rlepaksh-64mst0-node1-installer 10.0.207.98 _admin,mon,installer,mgr ceph-rlepaksh-64mst0-node2 10.0.206.111 mon,mgr ceph-rlepaksh-64mst0-node3 10.0.209.219 osd,mon ceph-rlepaksh-64mst0-node4 10.0.208.84 osd,mds ceph-rlepaksh-64mst0-node5 10.0.206.117 osd,rgw,mds,nvmeof-gw How reproducible: Steps to Reproduce: 1. Deploy nvmeof service - ceph config set mgr mgr/cephadm/container_image_nvmeof registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:latest --> pulled version 0.0.4-1 ceph orch apply nvmeof rbd --placement="cali010" 2. Created subsystem and listener port along with open host access 3. Installed nvme-cli, modprobe nvme-fabrics and tried to discover but discovery fails but when tried to discover subsystems deployed with ceph-nvmeof:0.0.3 with ceph version 18.2.0-57.el9cp Actual results: Discovery is unsuccessful Expected results: Discovery should be succesful Additional info:
Automated test runs are also failing due to "nvme discover" command failure on initiator node. As a reason host could not connect and run IO 2023-10-03 06:40:33,526 (cephci.test_ceph_nvmeof_gateway) [INFO] - cephci.Sanity.69.cephci.ceph.ceph.py:1558 - Running command nvme discover --transport tcp --traddr 10.0.195.85 --trsvcid 5001 --output-format json on 10.0.195.96 timeout 600 2023-10-03 06:40:34,383 (cephci.test_ceph_nvmeof_gateway) [DEBUG] - cephci.Sanity.69.cephci.tests.nvmeof.test_ceph_nvmeof_gateway.py:85 - 2023-10-03 06:40:34,384 (cephci.test_ceph_nvmeof_gateway) [ERROR] - cephci.Sanity.69.cephci.ceph.parallel.py:93 - Exception in parallel execution Traceback (most recent call last): File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 88, in __exit__ for result in self: File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 106, in __next__ resurrect_traceback(result) File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 35, in resurrect_traceback raise exc_info[0](exc_info[1]).with_traceback(exc_info[2]) TypeError: JSONDecodeError.__init__() missing 2 required positional arguments: 'doc' and 'pos' Logs - http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.0-70/Sanity/69/tier-0_nvmeof_sanity/Basic_E2ETest_Ceph_NVMEoF_GW_sanity_test_0.log
FYI that "nvme connect" works provided user/customer has to have prior knowledge of subsystem nqn to which he needs to connect. But according to documentation and procedure that customer follows this BZ still remains a blocker as correct procedure is discover(failing currently) followed by connection to subsystem [root@cali012 ~]# nvme connect --transport tcp --traddr 10.8.130.10 --trsvcid 5001 -n nqn.2016-06.io.spdk:cnode1 [root@cali012 ~]# nvme list Node Generic SN Model Namespace Usage Format FW Rev --------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme1n1 /dev/ng1n1 1 SPDK bdev Controller 1 1.10 TB / 1.10 TB 4 KiB + 0 B 23.01.1 /dev/nvme0n1 /dev/ng0n1 X1N0A10VTC88 Dell Ent NVMe CM6 MU 1.6TB 1 257.30 GB / 1.60 TB 512 B + 0 B 2.1.8
It can be fixed by changing ceph adm (and make it available for 7.0)? The change is simply to set this: "enable_discovery_controller = true" , in the ceph-nvmeof.conf file. I'm checking with ceph adm team.
I am moving to ceph adm to make this change in ceph-nvmeof.conf file.
(In reply to Aviv Caro from comment #3) > It can be fixed by changing ceph adm (and make it available for 7.0)? The > change is simply to set this: > "enable_discovery_controller = true" , in the ceph-nvmeof.conf file. I'm > checking with ceph adm team. patch for this change https://gitlab.cee.redhat.com/ceph/ceph/-/commit/fd0956847f66c98c9c4a45f39e86ff6922a47262
Able to discover the subsystems from initiator node. [root@ceph-2sunilkumar-0kbpac-node7 cephuser]# nvme discover -t tcp -a 10.0.154.146 -s 5001 Discovery Log Number of Records 1, Generation counter 2 =====Discovery Log Entry 0====== trtype: tcp adrfam: ipv4 subtype: nvme subsystem treq: not required portid: 0 trsvcid: 5001 subnqn: nqn.2016-06.io.spdk:cnode_test traddr: 10.0.154.146 eflags: not specified sectype: none sh-5.1# cat ceph-nvmeof.conf # This file is generated by cephadm. [gateway] name = client.nvmeof.rbd.ceph-2sunilkumar-0kbpac-node6.wickjk group = None addr = 10.0.154.146 port = 5500 enable_auth = False state_update_notify = True state_update_interval_sec = 5 enable_discovery_controller = true [ceph] pool = rbd config_file = /etc/ceph/ceph.conf id = nvmeof.rbd.ceph-2sunilkumar-0kbpac-node6.wickjk [mtls] server_key = ./server.key client_key = ./client.key server_cert = ./server.crt client_cert = ./client.crt [spdk] tgt_path = /usr/local/bin/nvmf_tgt rpc_socket = /var/tmp/spdk.sock timeout = 60 log_level = WARN conn_retries = 10 transports = tcp transport_tcp_options = {"in_capsule_data_size": 8192, "max_io_qpairs_per_ctrlr": 7} ceph version : ceph version 18.2.0-86.el9cp (fd0956847f66c98c9c4a45f39e86ff6922a47262) reef (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7780