Bug 2241346
| Summary: | Unable to discover subsystems from Initiator with ceph-nvmeof version 0.0.4-1 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Rahul Lepakshi <rlepaksh> |
| Component: | Cephadm | Assignee: | Adam King <adking> |
| Status: | CLOSED ERRATA | QA Contact: | Sunil Kumar Nagaraju <sunnagar> |
| Severity: | urgent | Docs Contact: | Rivka Pollack <rpollack> |
| Priority: | unspecified | ||
| Version: | 7.0 | CC: | adking, akraj, cephqe-warriors, jcaratza, owasserm, sunnagar, tserlin |
| Target Milestone: | --- | Keywords: | Automation, TestBlocker |
| Target Release: | 7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-18.2.0-86.el9cp | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-12-13 15:24:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Automated test runs are also failing due to "nvme discover" command failure on initiator node. As a reason host could not connect and run IO
2023-10-03 06:40:33,526 (cephci.test_ceph_nvmeof_gateway) [INFO] - cephci.Sanity.69.cephci.ceph.ceph.py:1558 - Running command nvme discover --transport tcp --traddr 10.0.195.85 --trsvcid 5001 --output-format json on 10.0.195.96 timeout 600
2023-10-03 06:40:34,383 (cephci.test_ceph_nvmeof_gateway) [DEBUG] - cephci.Sanity.69.cephci.tests.nvmeof.test_ceph_nvmeof_gateway.py:85 -
2023-10-03 06:40:34,384 (cephci.test_ceph_nvmeof_gateway) [ERROR] - cephci.Sanity.69.cephci.ceph.parallel.py:93 - Exception in parallel execution
Traceback (most recent call last):
File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 88, in __exit__
for result in self:
File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 106, in __next__
resurrect_traceback(result)
File "/home/jenkins/ceph-builds/18.2.0-70/Sanity/69/cephci/ceph/parallel.py", line 35, in resurrect_traceback
raise exc_info[0](exc_info[1]).with_traceback(exc_info[2])
TypeError: JSONDecodeError.__init__() missing 2 required positional arguments: 'doc' and 'pos'
Logs - http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.0-70/Sanity/69/tier-0_nvmeof_sanity/Basic_E2ETest_Ceph_NVMEoF_GW_sanity_test_0.log
FYI that "nvme connect" works provided user/customer has to have prior knowledge of subsystem nqn to which he needs to connect. But according to documentation and procedure that customer follows this BZ still remains a blocker as correct procedure is discover(failing currently) followed by connection to subsystem
[root@cali012 ~]# nvme connect --transport tcp --traddr 10.8.130.10 --trsvcid 5001 -n nqn.2016-06.io.spdk:cnode1
[root@cali012 ~]# nvme list
Node Generic SN Model Namespace Usage
Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1 /dev/ng1n1 1 SPDK bdev Controller 1 1.10 TB / 1.10 TB 4 KiB + 0 B 23.01.1
/dev/nvme0n1 /dev/ng0n1 X1N0A10VTC88 Dell Ent NVMe CM6 MU 1.6TB 1 257.30 GB / 1.60 TB 512 B + 0 B 2.1.8
It can be fixed by changing ceph adm (and make it available for 7.0)? The change is simply to set this: "enable_discovery_controller = true" , in the ceph-nvmeof.conf file. I'm checking with ceph adm team. I am moving to ceph adm to make this change in ceph-nvmeof.conf file. (In reply to Aviv Caro from comment #3) > It can be fixed by changing ceph adm (and make it available for 7.0)? The > change is simply to set this: > "enable_discovery_controller = true" , in the ceph-nvmeof.conf file. I'm > checking with ceph adm team. patch for this change https://gitlab.cee.redhat.com/ceph/ceph/-/commit/fd0956847f66c98c9c4a45f39e86ff6922a47262 Able to discover the subsystems from initiator node.
[root@ceph-2sunilkumar-0kbpac-node7 cephuser]# nvme discover -t tcp -a 10.0.154.146 -s 5001
Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype: tcp
adrfam: ipv4
subtype: nvme subsystem
treq: not required
portid: 0
trsvcid: 5001
subnqn: nqn.2016-06.io.spdk:cnode_test
traddr: 10.0.154.146
eflags: not specified
sectype: none
sh-5.1# cat ceph-nvmeof.conf
# This file is generated by cephadm.
[gateway]
name = client.nvmeof.rbd.ceph-2sunilkumar-0kbpac-node6.wickjk
group = None
addr = 10.0.154.146
port = 5500
enable_auth = False
state_update_notify = True
state_update_interval_sec = 5
enable_discovery_controller = true
[ceph]
pool = rbd
config_file = /etc/ceph/ceph.conf
id = nvmeof.rbd.ceph-2sunilkumar-0kbpac-node6.wickjk
[mtls]
server_key = ./server.key
client_key = ./client.key
server_cert = ./server.crt
client_cert = ./client.crt
[spdk]
tgt_path = /usr/local/bin/nvmf_tgt
rpc_socket = /var/tmp/spdk.sock
timeout = 60
log_level = WARN
conn_retries = 10
transports = tcp
transport_tcp_options = {"in_capsule_data_size": 8192, "max_io_qpairs_per_ctrlr": 7}
ceph version : ceph version 18.2.0-86.el9cp (fd0956847f66c98c9c4a45f39e86ff6922a47262) reef (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7780 |
Description of problem: Unable to discover subsystems from Initiator cali012(10.8.130.12) on GW node cali010(10.8.130.10) creds:(root/passwd) [ceph: root@cali001 /]# ceph version ceph version 18.2.0-65.el9cp (93750e88ac3c3ae3750b3ab90c8b2d48c387fb5f) reef (stable) [ceph: root@cali001 /]# ceph orch host ls HOST ADDR LABELS STATUS cali001 10.8.130.1 _admin,osd,mgr,installer,mon cali004 10.8.130.4 osd,mgr,mon cali005 10.8.130.5 osd,mon cali008 10.8.130.8 osd cali010 10.8.130.10 osd,nvmeof-gw 5 hosts in cluster [root@cali012 ~]# nvme discover --transport tcp --traddr 10.8.130.10 --trsvcid 5001 failed to add controller, error Unknown error -1 [root@cali010 log]# podman run registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:0.0.3-1 --server-address 10.8.130.10 --server-port 5500 get_subsystems INFO:__main__:Get subsystems: [ { "nqn": "nqn.2016-06.io.spdk:cnode1", "subtype": "NVMe", "listen_addresses": [ { "transport": "TCP", "trtype": "TCP", "adrfam": "IPv4", "traddr": "10.8.130.10", "trsvcid": "5001" } ], "allow_any_host": true, "hosts": [], "serial_number": "1", "model_number": "SPDK bdev Controller", "max_namespaces": 256, "min_cntlid": 1, "max_cntlid": 65519, "namespaces": [] } ] From /var/log/messages on initiator node Sep 29 12:10:00 cali012 kernel: nvme nvme1: Connect Invalid Data Parameter, subsysnqn "nqn.2014-08.org.nvmexpress.discovery" Sep 29 12:10:00 cali012 kernel: nvme nvme1: failed to connect queue: 0 ret=386 Version-Release number of selected component (if applicable): But when I try discovery for a cluster with ceph-nvmeof:0.0.3 version below is kernel message and discovery is successful - Sep 29 12:11:14 cali012 kernel: nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.206.117:5001 Sep 29 12:11:14 cali012 kernel: nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" [root@cali012 ~]# nvme discover -t tcp -a 10.0.206.117 -s 5001 Discovery Log Number of Records 1, Generation counter 2 =====Discovery Log Entry 0====== trtype: tcp adrfam: ipv4 subtype: nvme subsystem treq: not required portid: 0 trsvcid: 5001 subnqn: nqn.2016-06.io.spdk:cnode1 traddr: 10.0.206.117 eflags: not specified sectype: none creds - (cephuser/cephuser) # ceph orch host ls HOST ADDR LABELS STATUS ceph-rlepaksh-64mst0-node1-installer 10.0.207.98 _admin,mon,installer,mgr ceph-rlepaksh-64mst0-node2 10.0.206.111 mon,mgr ceph-rlepaksh-64mst0-node3 10.0.209.219 osd,mon ceph-rlepaksh-64mst0-node4 10.0.208.84 osd,mds ceph-rlepaksh-64mst0-node5 10.0.206.117 osd,rgw,mds,nvmeof-gw How reproducible: Steps to Reproduce: 1. Deploy nvmeof service - ceph config set mgr mgr/cephadm/container_image_nvmeof registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:latest --> pulled version 0.0.4-1 ceph orch apply nvmeof rbd --placement="cali010" 2. Created subsystem and listener port along with open host access 3. Installed nvme-cli, modprobe nvme-fabrics and tried to discover but discovery fails but when tried to discover subsystems deployed with ceph-nvmeof:0.0.3 with ceph version 18.2.0-57.el9cp Actual results: Discovery is unsuccessful Expected results: Discovery should be succesful Additional info: