Bug 2277699

Summary: NVMe Deployment failed with Ceph 18.2.1-155 and NVMeoF 1.2.4-1
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Sunil Kumar Nagaraju <sunnagar>
Component: NVMeOFAssignee: Aviv Caro <acaro>
Status: CLOSED ERRATA QA Contact: Manohar Murthy <mmurthy>
Severity: high Docs Contact: ceph-doc-bot <ceph-doc-bugzilla>
Priority: unspecified    
Version: 7.1CC: cephqe-warriors, tserlin
Target Milestone: ---Keywords: TestBlocker
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-nvmeof-container-1.2.5-3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-06-13 14:32:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sunil Kumar Nagaraju 2024-04-29 09:14:40 UTC
Created attachment 2029954 [details]
NVMe service log

Description of problem:

NVMe Deployment failed with Ceph 18.2.1-155 and NVMeoF 1.2.4-1 versions.


Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [2024-04-28 06:37:59.596636] app.c: 712:spdk_app_start: *NOTICE*: Total cores available: 4
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [2024-04-28 06:37:59.650074] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 1
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [2024-04-28 06:37:59.650143] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 2
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [2024-04-28 06:37:59.650203] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 3
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [2024-04-28 06:37:59.650207] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 0
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [2024-04-28 06:37:59.692980] accel_sw.c: 681:sw_accel_module_init: *NOTICE*: Accel framework software module initialized.
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [2024-04-28 06:37:59.838632] tcp.c: 629:nvmf_tcp_create: *NOTICE*: *** TCP Transport Init ***
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [28-Apr-2024 06:37:59] INFO server.py:249: Discovery service process id: 63
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [28-Apr-2024 06:37:59] INFO server.py:245: Starting ceph nvmeof discovery service
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: [28-Apr-2024 06:37:59] ERROR server.py:108: GatewayServer exception occurred:
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: Traceback (most recent call last):
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:   File "/remote-source/ceph-nvmeof/app/control/__main__.py", line 43, in <module>
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:     gateway.serve()
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 177, in serve
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:     omap_lock = OmapLock(omap_state, gateway_state)
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:   File "/remote-source/ceph-nvmeof/app/control/state.py", line 201, in __init__
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:     self.omap_file_lock_retry_sleep_interval = self.omap_state.config.getint_with_default("gateway",
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:   File "/remote-source/ceph-nvmeof/app/control/config.py", line 47, in getint_with_default
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:     return self.config.getint(section, param, fallback=value)
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:   File "/usr/lib64/python3.9/configparser.py", line 818, in getint
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:     return self._get_conv(section, option, int, raw=raw, vars=vars,
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:   File "/usr/lib64/python3.9/configparser.py", line 808, in _get_conv
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:     return self._get(section, conv, option, raw=raw, vars=vars,
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:   File "/usr/lib64/python3.9/configparser.py", line 803, in _get
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]:     return conv(self.get(section, option, **kwargs))
Apr 28 02:37:59 ceph-sunilkumar-00-pvlfdn-node7 ceph-3c4aaa88-0528-11ef-a216-fa163e4f1077-nvmeof-rbd-ceph-sunilkumar-00-pvlfdn-node7-xxjgor[14769]: ValueError: invalid literal for int() with base 10: '1.0'


Version-Release number of selected component (if applicable):

nvmeof_image=registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:1.2.4-1 
nvmeof_cli_image=registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:1.2.4-1 
ceph-repo http://download.devel.redhat.com/rhel-9/composes/auto/ceph-7.1-rhel-9/RHCEPH-7.1-RHEL-9-20240424.ci.3 
Ceph-image= registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-7.1-rhel-9-containers-candidate-86483-20240424220941


How reproducible:


Steps to Reproduce:
1. Bootstrap Ceph cluster and add all Ceph daemons MON, MGR, OSDS
2. Create RBD pool and deploy NVMeoF.
3. User can notice the issue.

Comment 6 Sunil Kumar Nagaraju 2024-05-03 10:54:19 UTC
NVMeof Deployment works with new build,
Ceph 18.2.1-159 and NVMe 1.2.5-2

attching HA Sanity logs for reference.

Comment 8 Sunil Kumar Nagaraju 2024-05-03 10:55:41 UTC
[ceph: root@ceph-sunilkumar-00-bjcvqj-node1-installer /]# ceph orch ps --daemon_type nvmeof
NAME                                               HOST                             PORTS             STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node6.vtwjfa  ceph-sunilkumar-00-bjcvqj-node6  *:5500,4420,8009  running (8m)     8m ago  20h     116M        -           fe96956aabcd  d1c6bed9582f
nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node7.vglfty  ceph-sunilkumar-00-bjcvqj-node7  *:5500,4420,8009  running (8m)     8m ago  20h     117M        -           fe96956aabcd  a2a276907e3f
nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node8.hnmfps  ceph-sunilkumar-00-bjcvqj-node8  *:5500,4420,8009  running (8m)     8m ago  20h     116M        -           fe96956aabcd  c7dad16fe4b1
nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node9.guwzdw  ceph-sunilkumar-00-bjcvqj-node9  *:5500,4420,8009  running (8m)     8m ago  20h    48.0M        -           fe96956aabcd  ef3345747f1a
[ceph: root@ceph-sunilkumar-00-bjcvqj-node1-installer /]# ceph orch ls --service_type nvmeof
NAME        PORTS             RUNNING  REFRESHED  AGE  PLACEMENT
nvmeof.rbd  ?:4420,5500,8009      4/4  8m ago     20h  ceph-sunilkumar-00-bjcvqj-node6;ceph-sunilkumar-00-bjcvqj-node7;ceph-sunilkumar-00-bjcvqj-node8;ceph-sunilkumar-00-bjcvqj-node9

Comment 9 errata-xmlrpc 2024-06-13 14:32:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925