Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2240588

Summary: NVMeoF GW deployment failed with rados.PermissionDeniedError error
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Sunil Kumar Nagaraju <sunnagar>
Component: CephadmAssignee: Adam King <adking>
Status: CLOSED ERRATA QA Contact: Sunil Kumar Nagaraju <sunnagar>
Severity: urgent Docs Contact: Rivka Pollack <rpollack>
Priority: unspecified    
Version: 7.0CC: acaro, adking, akraj, cephqe-warriors, tserlin, vereddy
Target Milestone: ---Keywords: Automation
Target Release: 7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-18.2.0-56.el9cp Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-12-13 15:24:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sunil Kumar Nagaraju 2023-09-25 11:28:45 UTC
Created attachment 1990434 [details]
service journalctl file

Description of problem:

NVMeoF deployment failed with Rados permission error as below

The first is the gateway container:
registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:0.0.3-1


>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: INFO:control.server:SPDK Target Path: /usr/local/bin/nvmf_tgt
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: INFO:control.server:SPDK Socket: /var/tmp/spdk.sock
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: INFO:control.server:Starting /usr/local/bin/nvmf_tgt -u -r /var/tmp/spdk.sock
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: INFO:control.server:Attempting to initialize SPDK: rpc_socket: /var/tmp/spdk.sock, conn_retries: 300, timeout: 60.0
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: INFO: Setting log level to WARN
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: INFO:JSONRPCClient(/var/tmp/spdk.sock):Setting log level to WARN
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: [2023-09-25 11:13:14.788580] Starting SPDK v23.01.1 / DPDK 22.11.0 initialization...
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: [2023-09-25 11:13:14.788679] [ DPDK EAL parameters: nvmf --no-shconf -c 0x1 --no-pci --huge-unlink --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --iova-mode=pa --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid3 ]
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: TELEMETRY: No legacy callbacks, legacy socket not created
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: [2023-09-25 11:13:14.921881] app.c: 712:spdk_app_start: *NOTICE*: Total cores available: 1
>>>Sep 25 07:13:14 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: [2023-09-25 11:13:14.967586] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 0
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: [2023-09-25 11:13:15.004660] accel_sw.c: 681:sw_accel_module_init: *NOTICE*: Accel framework software module initialized.
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: DEBUG:control.server:create_transport: tcp options: {"in_capsule_data_size" : 8192, "max_io_qpairs_per_ctrlr" : 7}
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: [2023-09-25 11:13:15.149482] tcp.c: 629:nvmf_tcp_create: *NOTICE*: *** TCP Transport Init ***
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: ERROR:control.state:Unable to create omap: [errno 13] RADOS permission denied (error connecting to the cluster). Exiting!
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: ERROR:control.server:GatewayServer exception occurred:
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: Traceback (most recent call last):
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]:   File "/remote-source/ceph-nvmeof/app/control/__main__.py", line 35, in <module>
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]:     gateway.serve()
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 98, in serve
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]:     omap_state = OmapGatewayState(self.config)
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]:   File "/remote-source/ceph-nvmeof/app/control/state.py", line 179, in __init__
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]:     conn.connect()
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]:   File "rados.pyx", line 690, in rados.Rados.connect
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: rados.PermissionDeniedError: [errno 13] RADOS permission denied (error connecting to the cluster)
>>>Sep 25 07:13:15 ceph-2sunilkumar-lxztyv-node2 ceph-c6963904-5b8d-11ee-924e-fa163e19eae4-nvmeof-rbd-ceph-2sunilkumar-lxztyv-node2-huaunh[15284]: INFO:control.server:Terminating SPDK(ceph-2sunilkumar-lxztyv-node2) pid 3...
>>>

Version-Release number of selected component (if applicable):

# ceph version 
ceph version 18.2.0-47.el9cp (4b54338fe0c515edb779f25acdcc35dcee28c821) reef (stable)
NVMe image: registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:0.0.3-1


How reproducible: always


Steps to Reproduce:
1. Deployed the cluster latest downstream ceph image 

2. set nvmeof container image configuration on the cluster with below command

ceph config set mgr mgr/cephadm/container_image_nvmeof registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:0.0.3-1

3. deploy the nvmeof GW using orchestrator.

 ceph orch apply nvmeof rbd --placement="ceph-2sunilkumar-lxztyv-node2"
Scheduled nvmeof.rbd update...


Actual results:
# ceph orch ps | grep nvm 
nvmeof.rbd.ceph-2sunilkumar-lxztyv-node2.huaunh        ceph-2sunilkumar-lxztyv-node2            *:5500,4420,8009  error                  -  10m        -        -  <unknown>        <unknown>     <unknown>   
# ceph orch ps | grep nvm 
nvmeof.rbd.ceph-2sunilkumar-lxztyv-node2.huaunh        ceph-2sunilkumar-lxztyv-node2            *:5500,4420,8009  error                  -  10m        -        -  <unknown>        <unknown>     <unknown>   


Expected results:
Gateway should be deployed

Comment 1 Aviv Caro 2023-09-26 09:03:18 UTC
Sunil please provide the ceph-nvmeof.conf file.

Comment 2 Sunil Kumar Nagaraju 2023-09-26 10:34:32 UTC
Hi Aviv,

PLease find the ceph-nvmeof-conf file contents.


# cat /var/lib/ceph/c6963904-5b8d-11ee-924e-fa163e19eae4/nvmeof.rbd.ceph-2sunilkumar-lxztyv-node2.cdtsax/ceph-nvmeof.conf 
# This file is generated by cephadm.
[gateway]
name = client.nvmeof.rbd.ceph-2sunilkumar-lxztyv-node2.cdtsax
group = None
addr = 10.0.209.240
port = 5500
enable_auth = False
state_update_notify = True
state_update_interval_sec = 5

[ceph]
pool = rbd
config_file = /etc/ceph/ceph.conf
id = nvmeof.rbd.ceph-2sunilkumar-lxztyv-node2.cdtsax

[mtls]
server_key = ./server.key
client_key = ./client.key
server_cert = ./server.crt
client_cert = ./client.crt

[spdk]
tgt_path = /usr/local/bin/nvmf_tgt
rpc_socket = /var/tmp/spdk.sock
timeout = 60
log_level = WARN
conn_retries = 10
transports = tcp
transport_tcp_options = {"in_capsule_data_size": 8192, "max_io_qpairs_per_ctrlr": 7}

Comment 3 Alexander Indenbaum 2023-09-26 13:19:47 UTC
Hello Sunil,

Could you also provide continent of the following files:
- /etc/ceph/ceph.conf
- /etc/ceph/keyring

It is also interesting if you could access the ceph cluster using credentials provided by the above files.
Additionally could you please check which nvmeof container image is used.

Thank you!

Comment 5 Alexander Indenbaum 2023-09-27 10:07:34 UTC
Hello Sunil,

Thank you for all the info. I am able to connect to the bootrap node using ssh and credentials above. How do I access "10.0.209.240 ceph-2sunilkumar-lxztyv-node2"?


According to https://bugzilla.redhat.com/show_bug.cgi?id=2240588#c2, the gateway group definition, it seems the ceph image lacks the following change https://github.com/ceph/ceph/pull/53408

Comment 7 Aviv Caro 2023-09-27 15:23:51 UTC
Sunil, can you also include some rados logs - maybe we can see there why do we have this permission error?

Comment 12 Sunil Kumar Nagaraju 2023-10-04 05:41:14 UTC
Deployment working fine with latest builds. 

ceph-version: 18.2.0-72

Comment 16 errata-xmlrpc 2023-12-13 15:24:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780