Bug 2274704 - Namespace count mismatch between two nvmeof gateways from same ceph cluster and pool
Summary: Namespace count mismatch between two nvmeof gateways from same ceph cluster a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: NVMeOF
Version: 7.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 7.1
Assignee: Aviv Caro
QA Contact: Manohar Murthy
ceph-doc-bot
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-04-12 10:35 UTC by Rahul Lepakshi
Modified: 2024-06-13 14:31 UTC (History)
3 users (show)

Fixed In Version: ceph-18.2.1-149.el9cp
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-06-13 14:31:37 UTC
Embargoed:
rlepaksh: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-8801 0 None None None 2024-04-12 10:37:21 UTC
Red Hat Product Errata RHSA-2024:3925 0 None None None 2024-06-13 14:31:40 UTC

Description Rahul Lepakshi 2024-04-12 10:35:43 UTC
Description of problem:
Observing namespace mismatch as below

```
# ceph orch ps | grep nvmeof
nvmeof.nvmeof_pool.argo023.xkyblu  argo023  *:5500,4420,8009  running (105m)   105s ago  105m     550M        -                    b09894a2fc25  fede1b63c50e
nvmeof.nvmeof_pool.argo024.xrmciw  argo024  *:5500,4420,8009  running (105m)   105s ago  105m     663M        -                    b09894a2fc25  9d5c575f2be9


# podman run --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1  --server-address 10.8.128.223 --server-port 5500 subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤══════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │   Serial │ Controller IDs   │   Namespace │          Max │
│           │                            │            │   Number │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪══════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │        1 │ 1-2040           │           0 │         2048 │
├───────────┼────────────────────────────┼────────────┼──────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │        2 │ 1-2040           │         200 │         2048 │
╘═══════════╧════════════════════════════╧════════════╧══════════╧══════════════════╧═════════════╧══════════════╛

# podman run --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1  --server-address 10.8.128.224  --server-port 5500 subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤══════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │   Serial │ Controller IDs   │   Namespace │          Max │
│           │                            │            │   Number │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪══════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │        1 │ 2041-4080        │         299 │         2048 │
├───────────┼────────────────────────────┼────────────┼──────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │        2 │ 2041-4080        │         200 │         2048 │
╘═══════════╧════════════════════════════╧════════════╧══════════╧══════════════════╧═════════════╧══════════════╛
```
```
[root@argo023]# podman run --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1  --server-address 10.8.128.223 --server-port 5500 namespace list -n nqn.2016-06.io.spdk:cnode1
No namespaces in subsystem nqn.2016-06.io.spdk:cnode1

[root@argo024 ~]# podman run --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1  --server-address 10.8.128.224 --server-port 5500 namespace list -n nqn.2016-06.io.spdk:cnode1
Namespaces in subsystem nqn.2016-06.io.spdk:cnode1:
╒════════╤════════════════════════╤════════╤═══════════════╤═════════╤═════════╤═════════════════════╤═════════════╤═══════════╤═══════════╤════════════╤═════════════╕
│   NSID │ Bdev                   │ RBD    │ RBD           │ Image   │ Block   │ UUID                │        Load │ R/W IOs   │ R/W MBs   │ Read MBs   │ Write MBs   │
│        │ Name                   │ Pool   │ Image         │ Size    │ Size    │                     │   Balancing │ per       │ per       │ per        │ per         │
│        │                        │        │               │         │         │                     │       Group │ second    │ second    │ second     │ second      │
╞════════╪════════════════════════╪════════╪═══════════════╪═════════╪═════════╪═════════════════════╪═════════════╪═══════════╪═══════════╪════════════╪═════════════╡
│      1 │ bdev_f2783d15-41bc-    │ rbd    │ ZVWD-image1   │ 1 TiB   │ 512 B   │ f2783d15-41bc-4bb5- │           1 │ unlimited │ unlimited │ unlimited  │ unlimited   │
│        │ 4bb5-896b-3925e28e44dd │        │               │         │         │ 896b-3925e28e44dd   │             │           │           │            │             │
├────────┼────────────────────────┼────────┼───────────────┼─────────┼─────────┼─────────────────────┼─────────────┼───────────┼───────────┼────────────┼─────────────┤
│      2 │ bdev_4f6c0f8d-2ef1-    │ rbd    │ ZVWD-image2   │ 1 TiB   │ 512 B   │ 4f6c0f8d-2ef1-4d82- │           1 │ unlimited │ unlimited │ unlimited  │ unlimited   │
│        │ 4d82-9435-cd03668e26dc │        │               │         │         │ 9435-cd03668e26dc   │             │           │           │            │             │
.
.
.
│    298 │ bdev_58eff0f6-d273-    │ rbd    │ OHN3-image98  │ 1 TiB   │ 512 B   │ 58eff0f6-d273-4b99- │           1 │ unlimited │ unlimited │ unlimited  │ unlimited   │
│        │ 4b99-95c4-ee2e6a5cb96c │        │               │         │         │ 95c4-ee2e6a5cb96c   │             │           │           │            │             │
├────────┼────────────────────────┼────────┼───────────────┼─────────┼─────────┼─────────────────────┼─────────────┼───────────┼───────────┼────────────┼─────────────┤
│    299 │ bdev_d13690e8-ece5-    │ rbd    │ OHN3-image99  │ 1 TiB   │ 512 B   │ d13690e8-ece5-4bd5- │           1 │ unlimited │ unlimited │ unlimited  │ unlimited   │
│        │ 4bd5-bf59-eddef0a6bc72 │        │               │         │         │ bf59-eddef0a6bc72   │             │           │           │            │             │
╘════════╧════════════════════════╧════════╧═══════════════╧═════════╧═════════╧═════════════════════╧═════════════╧═══════════╧═══════════╧════════════╧═════════════╛

```


Version-Release number of selected component (if applicable):
# ceph version
ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)

cp.stg.icr.io/cp/ibm-ceph/nvmeof-rhel9:1.2.0-1

How reproducible:
once till now

Steps to Reproduce:
1.Deploy nvmeof service with  cp.stg.icr.io/cp/ibm-ceph/nvmeof-rhel9:1.2.0-1
2. Configure 2 subsystems and scale to 400 namespaces - 200 per subsystem - successful
3. With IO to earlier namespaces, further scale to 100 namespaces on subsystem1 , it fails to add 299th namespace overall on that subsystem
4. Query for subsystem list and we see mismatch in count of NS

Actual results: Query for subsystem lists and we see mismatch in namespaces of subsystem


Expected results: Both Gateways state should be up to date


Additional info:

Comment 1 Aviv Caro 2024-04-20 15:57:28 UTC
Rahul is it still happening on latest downstream build?

Comment 2 Aviv Caro 2024-04-22 13:25:31 UTC
Fixed in Ceph 7.1 Build (IBM-CEPH-7.1-202404190257.ci.0).

Comment 6 Rahul Lepakshi 2024-04-29 06:07:38 UTC
Closing this BZ as issue was not seen with latest builds 
Pass log - http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/openstack/IBM/7.1/rhel-9/Regression/18.2.1-149/nvmeotcp/105/tier-3_2-nvmeof-gw_8-sub_ns/
http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/openstack/IBM/7.1/rhel-9/Regression/18.2.1-149/nvmeotcp/105/tier-3_2-nvmeof-gw_2-sub_ns/

2024-04-25 13:46:22,406 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.nvmegw_cli.execute.py:16 - NVMe CLI command : namespace add
2024-04-25 13:46:22,407 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.ceph.py:1568 - Running command podman run --quiet --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.4-1  --server-address 10.0.195.98 --server-port 5500 namespace add  --rbd-image L5N6-image200 --nsid 200 --rbd-pool rbd --subsystem nqn.2016-06.io.spdk:cnode2 on 10.0.195.98 timeout 600
2024-04-25 13:46:23,540 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.ceph.py:1602 - Command completed successfully
2024-04-25 13:46:23,548 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.nvmegw_cli.execute.py:36 - ('', 'Adding namespace 200 to nqn.2016-06.io.spdk:cnode2, load balancing group 0: Successful\n')
2024-04-25 13:46:23,549 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.nvmegw_cli.execute.py:16 - NVMe CLI command : namespace list
2024-04-25 13:46:23,550 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.ceph.py:1568 - Running command podman run --quiet --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.4-1  --format json --server-address 10.0.195.98 --server-port 5500 namespace list  --nsid 200 --subsystem nqn.2016-06.io.spdk:cnode2 on 10.0.195.98 timeout 600
2024-04-25 13:46:24,895 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.ceph.py:1602 - Command completed successfully
2024-04-25 13:46:24,896 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.nvmegw_cli.execute.py:36 - ('', '{\n    "error_message": "Success",\n    "subsystem_nqn": "nqn.2016-06.io.spdk:cnode2",\n    "namespaces": [\n        {\n            "nsid": 200,\n            "bdev_name": "bdev_84e30207-7a60-4657-b126-b2a59d036b76",\n            "rbd_image_name": "L5N6-image200",\n            "rbd_pool_name": "rbd",\n            "load_balancing_group": 1,\n            "block_size": 512,\n            "rbd_image_size": "1099511627776",\n            "uuid": "84e30207-7a60-4657-b126-b2a59d036b76",\n            "rw_ios_per_second": "0",\n            "rw_mbytes_per_second": "0",\n            "r_mbytes_per_second": "0",\n            "w_mbytes_per_second": "0"\n        }\n    ],\n    "status": 0\n}\n')

Comment 7 errata-xmlrpc 2024-06-13 14:31:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925


Note You need to log in before you can comment on or make changes to this bug.