2241346 – Unable to discover subsystems from Initiator with ceph-nvmeof version 0.0.4-1

Bug 2241346 - Unable to discover subsystems from Initiator with ceph-nvmeof version 0.0.4-1

Summary: Unable to discover subsystems from Initiator with ceph-nvmeof version 0.0.4-1

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	7.0
Assignee:	Adam King
QA Contact:	Sunil Kumar Nagaraju
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-29 12:26 UTC by Rahul Lepakshi
Modified:	2023-12-13 15:24 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ceph-18.2.0-86.el9cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-13 15:24:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-7593	0	None	None	None	2023-09-29 12:27:53 UTC
Red Hat Product Errata	RHBA-2023:7780	0	None	None	None	2023-12-13 15:24:11 UTC

Description Rahul Lepakshi 2023-09-29 12:26:51 UTC

Description of problem:
Unable to discover subsystems from Initiator cali012(10.8.130.12) on GW node cali010(10.8.130.10) creds:(root/passwd)

[ceph: root@cali001 /]# ceph version
ceph version 18.2.0-65.el9cp (93750e88ac3c3ae3750b3ab90c8b2d48c387fb5f) reef (stable)

[ceph: root@cali001 /]# ceph orch host ls
HOST     ADDR         LABELS                        STATUS
cali001  10.8.130.1   _admin,osd,mgr,installer,mon
cali004  10.8.130.4   osd,mgr,mon
cali005  10.8.130.5   osd,mon
cali008  10.8.130.8   osd
cali010  10.8.130.10  osd,nvmeof-gw
5 hosts in cluster

[root@cali012 ~]# nvme discover  --transport tcp --traddr 10.8.130.10 --trsvcid 5001
failed to add controller, error Unknown error -1

[root@cali010 log]# podman run registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:0.0.3-1 --server-address 10.8.130.10 --server-port 5500 get_subsystems
INFO:__main__:Get subsystems:
[
    {
        "nqn": "nqn.2016-06.io.spdk:cnode1",
        "subtype": "NVMe",
        "listen_addresses": [
            {
                "transport": "TCP",
                "trtype": "TCP",
                "adrfam": "IPv4",
                "traddr": "10.8.130.10",
                "trsvcid": "5001"
            }
        ],
        "allow_any_host": true,
        "hosts": [],
        "serial_number": "1",
        "model_number": "SPDK bdev Controller",
        "max_namespaces": 256,
        "min_cntlid": 1,
        "max_cntlid": 65519,
        "namespaces": []
    }
]

From /var/log/messages on initiator node
Sep 29 12:10:00 cali012 kernel: nvme nvme1: Connect Invalid Data Parameter, subsysnqn "nqn.2014-08.org.nvmexpress.discovery"
Sep 29 12:10:00 cali012 kernel: nvme nvme1: failed to connect queue: 0 ret=386
Version-Release number of selected component (if applicable):

But when I try discovery for a cluster with ceph-nvmeof:0.0.3 version below is kernel message and discovery is successful - 
Sep 29 12:11:14 cali012 kernel: nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.206.117:5001
Sep 29 12:11:14 cali012 kernel: nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

[root@cali012 ~]# nvme discover -t tcp -a 10.0.206.117 -s 5001

Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not required
portid:  0
trsvcid: 5001
subnqn:  nqn.2016-06.io.spdk:cnode1
traddr:  10.0.206.117
eflags:  not specified
sectype: none

creds - (cephuser/cephuser)
# ceph orch host ls
HOST                                  ADDR          LABELS                    STATUS
ceph-rlepaksh-64mst0-node1-installer  10.0.207.98   _admin,mon,installer,mgr
ceph-rlepaksh-64mst0-node2            10.0.206.111  mon,mgr
ceph-rlepaksh-64mst0-node3            10.0.209.219  osd,mon
ceph-rlepaksh-64mst0-node4            10.0.208.84   osd,mds
ceph-rlepaksh-64mst0-node5            10.0.206.117  osd,rgw,mds,nvmeof-gw

How reproducible:


Steps to Reproduce:
1. Deploy nvmeof service - 
    ceph config set mgr mgr/cephadm/container_image_nvmeof registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:latest  --> pulled version 0.0.4-1
    ceph orch apply nvmeof rbd --placement="cali010"
2. Created subsystem and listener port along with open host access
3. Installed nvme-cli, modprobe nvme-fabrics and tried to discover but discovery fails but when tried to discover subsystems deployed with ceph-nvmeof:0.0.3 with ceph version 18.2.0-57.el9cp



Actual results:
Discovery is unsuccessful

Expected results:
Discovery should be succesful

Additional info:

Comment 1 Rahul Lepakshi 2023-10-03 05:20:55 UTC

Automated test runs are also failing due to "nvme discover" command failure on initiator node. As a reason host could not connect and run IO

2023-10-03 06:40:33,526 (cephci.test_ceph_nvmeof_gateway) [INFO] - cephci.Sanity.69.cephci.ceph.ceph.py:1558 - Running command nvme discover  --transport tcp --traddr 10.0.195.85 --trsvcid 5001 --output-format json on 10.0.195.96 timeout 600
2023-10-03 06:40:34,383 (cephci.test_ceph_nvmeof_gateway) [DEBUG] - cephci.Sanity.69.cephci.tests.nvmeof.test_ceph_nvmeof_gateway.py:85 - 
2023-10-03 06:40:34,384 (cephci.test_ceph_nvmeof_gateway) [ERROR] - cephci.Sanity.69.cephci.ceph.parallel.py:93 - Exception in parallel execution
Traceback (most recent call last):
  File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 88, in __exit__
    for result in self:
  File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 106, in __next__
    resurrect_traceback(result)
  File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 35, in resurrect_traceback
    raise exc_info[0](exc_info[1]).with_traceback(exc_info[2])
TypeError: JSONDecodeError.__init__() missing 2 required positional arguments: 'doc' and 'pos'

Logs - http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.0-70/Sanity/69/tier-0_nvmeof_sanity/Basic_E2ETest_Ceph_NVMEoF_GW_sanity_test_0.log

Comment 2 Rahul Lepakshi 2023-10-03 05:47:57 UTC

FYI that "nvme connect" works provided user/customer has to have prior knowledge of subsystem nqn to which he needs to connect. But according to documentation and procedure that customer follows this BZ still remains a blocker as correct procedure is discover(failing currently) followed by connection to subsystem

[root@cali012 ~]# nvme connect  --transport tcp --traddr 10.8.130.10 --trsvcid 5001 -n nqn.2016-06.io.spdk:cnode1
[root@cali012 ~]# nvme list
Node                  Generic               SN                   Model                                    Namespace Usage
          Format           FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1          /dev/ng1n1            1                    SPDK bdev Controller                     1           1.10  TB /   1.10  TB      4 KiB +  0 B   23.01.1
/dev/nvme0n1          /dev/ng0n1            X1N0A10VTC88         Dell Ent NVMe CM6 MU 1.6TB               1         257.30  GB /   1.60  TB    512   B +  0 B   2.1.8

Comment 3 Aviv Caro 2023-10-04 08:09:28 UTC

It can be fixed by changing ceph adm (and make it available for 7.0)? The change is simply to set this:
"enable_discovery_controller = true" , in the ceph-nvmeof.conf file. I'm checking with ceph adm team.

Comment 4 Aviv Caro 2023-10-05 09:13:17 UTC

I am moving to ceph adm to make this change in ceph-nvmeof.conf file.

Comment 5 Adam King 2023-10-11 17:40:50 UTC

(In reply to Aviv Caro from comment #3)
> It can be fixed by changing ceph adm (and make it available for 7.0)? The
> change is simply to set this:
> "enable_discovery_controller = true" , in the ceph-nvmeof.conf file. I'm
> checking with ceph adm team.

patch for this change https://gitlab.cee.redhat.com/ceph/ceph/-/commit/fd0956847f66c98c9c4a45f39e86ff6922a47262

Comment 9 Sunil Kumar Nagaraju 2023-10-12 15:19:35 UTC

Able to discover the subsystems from initiator node.

[root@ceph-2sunilkumar-0kbpac-node7 cephuser]# nvme discover -t tcp -a 10.0.154.146 -s 5001 

Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not required
portid:  0
trsvcid: 5001
subnqn:  nqn.2016-06.io.spdk:cnode_test
traddr:  10.0.154.146
eflags:  not specified
sectype: none

sh-5.1# cat ceph-nvmeof.conf 
# This file is generated by cephadm.
[gateway]
name = client.nvmeof.rbd.ceph-2sunilkumar-0kbpac-node6.wickjk
group = None
addr = 10.0.154.146
port = 5500
enable_auth = False
state_update_notify = True
state_update_interval_sec = 5
enable_discovery_controller = true

[ceph]
pool = rbd
config_file = /etc/ceph/ceph.conf
id = nvmeof.rbd.ceph-2sunilkumar-0kbpac-node6.wickjk

[mtls]
server_key = ./server.key
client_key = ./client.key
server_cert = ./server.crt
client_cert = ./client.crt

[spdk]
tgt_path = /usr/local/bin/nvmf_tgt
rpc_socket = /var/tmp/spdk.sock
timeout = 60
log_level = WARN
conn_retries = 10
transports = tcp
transport_tcp_options = {"in_capsule_data_size": 8192, "max_io_qpairs_per_ctrlr": 7}



ceph version : ceph version 18.2.0-86.el9cp (fd0956847f66c98c9c4a45f39e86ff6922a47262) reef (stable)

Comment 12 errata-xmlrpc 2023-12-13 15:24:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780

Note You need to log in before you can comment on or make changes to this bug.