Bug 2274704

Summary: Namespace count mismatch between two nvmeof gateways from same ceph cluster and pool
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Rahul Lepakshi <rlepaksh>
Component: NVMeOFAssignee: Aviv Caro <acaro>
Status: CLOSED ERRATA QA Contact: Manohar Murthy <mmurthy>
Severity: urgent Docs Contact: ceph-doc-bot <ceph-doc-bugzilla>
Priority: unspecified    
Version: 7.1CC: akraj, cephqe-warriors, tserlin
Target Milestone: ---Flags: rlepaksh: needinfo-
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-18.2.1-149.el9cp Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-06-13 14:31:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rahul Lepakshi 2024-04-12 10:35:43 UTC
Description of problem:
Observing namespace mismatch as below

```
# ceph orch ps | grep nvmeof
nvmeof.nvmeof_pool.argo023.xkyblu  argo023  *:5500,4420,8009  running (105m)   105s ago  105m     550M        -                    b09894a2fc25  fede1b63c50e
nvmeof.nvmeof_pool.argo024.xrmciw  argo024  *:5500,4420,8009  running (105m)   105s ago  105m     663M        -                    b09894a2fc25  9d5c575f2be9


# podman run --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1  --server-address 10.8.128.223 --server-port 5500 subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤══════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │   Serial │ Controller IDs   │   Namespace │          Max │
│           │                            │            │   Number │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪══════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │        1 │ 1-2040           │           0 │         2048 │
├───────────┼────────────────────────────┼────────────┼──────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │        2 │ 1-2040           │         200 │         2048 │
╘═══════════╧════════════════════════════╧════════════╧══════════╧══════════════════╧═════════════╧══════════════╛

# podman run --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1  --server-address 10.8.128.224  --server-port 5500 subsystem list
Subsystems:
╒═══════════╤════════════════════════════╤════════════╤══════════╤══════════════════╤═════════════╤══════════════╕
│ Subtype   │ NQN                        │ HA State   │   Serial │ Controller IDs   │   Namespace │          Max │
│           │                            │            │   Number │                  │       Count │   Namespaces │
╞═══════════╪════════════════════════════╪════════════╪══════════╪══════════════════╪═════════════╪══════════════╡
│ NVMe      │ nqn.2016-06.io.spdk:cnode1 │ enabled    │        1 │ 2041-4080        │         299 │         2048 │
├───────────┼────────────────────────────┼────────────┼──────────┼──────────────────┼─────────────┼──────────────┤
│ NVMe      │ nqn.2016-06.io.spdk:cnode2 │ enabled    │        2 │ 2041-4080        │         200 │         2048 │
╘═══════════╧════════════════════════════╧════════════╧══════════╧══════════════════╧═════════════╧══════════════╛
```
```
[root@argo023]# podman run --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1  --server-address 10.8.128.223 --server-port 5500 namespace list -n nqn.2016-06.io.spdk:cnode1
No namespaces in subsystem nqn.2016-06.io.spdk:cnode1

[root@argo024 ~]# podman run --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1  --server-address 10.8.128.224 --server-port 5500 namespace list -n nqn.2016-06.io.spdk:cnode1
Namespaces in subsystem nqn.2016-06.io.spdk:cnode1:
╒════════╤════════════════════════╤════════╤═══════════════╤═════════╤═════════╤═════════════════════╤═════════════╤═══════════╤═══════════╤════════════╤═════════════╕
│   NSID │ Bdev                   │ RBD    │ RBD           │ Image   │ Block   │ UUID                │        Load │ R/W IOs   │ R/W MBs   │ Read MBs   │ Write MBs   │
│        │ Name                   │ Pool   │ Image         │ Size    │ Size    │                     │   Balancing │ per       │ per       │ per        │ per         │
│        │                        │        │               │         │         │                     │       Group │ second    │ second    │ second     │ second      │
╞════════╪════════════════════════╪════════╪═══════════════╪═════════╪═════════╪═════════════════════╪═════════════╪═══════════╪═══════════╪════════════╪═════════════╡
│      1 │ bdev_f2783d15-41bc-    │ rbd    │ ZVWD-image1   │ 1 TiB   │ 512 B   │ f2783d15-41bc-4bb5- │           1 │ unlimited │ unlimited │ unlimited  │ unlimited   │
│        │ 4bb5-896b-3925e28e44dd │        │               │         │         │ 896b-3925e28e44dd   │             │           │           │            │             │
├────────┼────────────────────────┼────────┼───────────────┼─────────┼─────────┼─────────────────────┼─────────────┼───────────┼───────────┼────────────┼─────────────┤
│      2 │ bdev_4f6c0f8d-2ef1-    │ rbd    │ ZVWD-image2   │ 1 TiB   │ 512 B   │ 4f6c0f8d-2ef1-4d82- │           1 │ unlimited │ unlimited │ unlimited  │ unlimited   │
│        │ 4d82-9435-cd03668e26dc │        │               │         │         │ 9435-cd03668e26dc   │             │           │           │            │             │
.
.
.
│    298 │ bdev_58eff0f6-d273-    │ rbd    │ OHN3-image98  │ 1 TiB   │ 512 B   │ 58eff0f6-d273-4b99- │           1 │ unlimited │ unlimited │ unlimited  │ unlimited   │
│        │ 4b99-95c4-ee2e6a5cb96c │        │               │         │         │ 95c4-ee2e6a5cb96c   │             │           │           │            │             │
├────────┼────────────────────────┼────────┼───────────────┼─────────┼─────────┼─────────────────────┼─────────────┼───────────┼───────────┼────────────┼─────────────┤
│    299 │ bdev_d13690e8-ece5-    │ rbd    │ OHN3-image99  │ 1 TiB   │ 512 B   │ d13690e8-ece5-4bd5- │           1 │ unlimited │ unlimited │ unlimited  │ unlimited   │
│        │ 4bd5-bf59-eddef0a6bc72 │        │               │         │         │ bf59-eddef0a6bc72   │             │           │           │            │             │
╘════════╧════════════════════════╧════════╧═══════════════╧═════════╧═════════╧═════════════════════╧═════════════╧═══════════╧═══════════╧════════════╧═════════════╛

```


Version-Release number of selected component (if applicable):
# ceph version
ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)

cp.stg.icr.io/cp/ibm-ceph/nvmeof-rhel9:1.2.0-1

How reproducible:
once till now

Steps to Reproduce:
1.Deploy nvmeof service with  cp.stg.icr.io/cp/ibm-ceph/nvmeof-rhel9:1.2.0-1
2. Configure 2 subsystems and scale to 400 namespaces - 200 per subsystem - successful
3. With IO to earlier namespaces, further scale to 100 namespaces on subsystem1 , it fails to add 299th namespace overall on that subsystem
4. Query for subsystem list and we see mismatch in count of NS

Actual results: Query for subsystem lists and we see mismatch in namespaces of subsystem


Expected results: Both Gateways state should be up to date


Additional info:

Comment 1 Aviv Caro 2024-04-20 15:57:28 UTC
Rahul is it still happening on latest downstream build?

Comment 2 Aviv Caro 2024-04-22 13:25:31 UTC
Fixed in Ceph 7.1 Build (IBM-CEPH-7.1-202404190257.ci.0).

Comment 6 Rahul Lepakshi 2024-04-29 06:07:38 UTC
Closing this BZ as issue was not seen with latest builds 
Pass log - http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/openstack/IBM/7.1/rhel-9/Regression/18.2.1-149/nvmeotcp/105/tier-3_2-nvmeof-gw_8-sub_ns/
http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/openstack/IBM/7.1/rhel-9/Regression/18.2.1-149/nvmeotcp/105/tier-3_2-nvmeof-gw_2-sub_ns/

2024-04-25 13:46:22,406 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.nvmegw_cli.execute.py:16 - NVMe CLI command : namespace add
2024-04-25 13:46:22,407 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.ceph.py:1568 - Running command podman run --quiet --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.4-1  --server-address 10.0.195.98 --server-port 5500 namespace add  --rbd-image L5N6-image200 --nsid 200 --rbd-pool rbd --subsystem nqn.2016-06.io.spdk:cnode2 on 10.0.195.98 timeout 600
2024-04-25 13:46:23,540 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.ceph.py:1602 - Command completed successfully
2024-04-25 13:46:23,548 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.nvmegw_cli.execute.py:36 - ('', 'Adding namespace 200 to nqn.2016-06.io.spdk:cnode2, load balancing group 0: Successful\n')
2024-04-25 13:46:23,549 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.nvmegw_cli.execute.py:16 - NVMe CLI command : namespace list
2024-04-25 13:46:23,550 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.ceph.py:1568 - Running command podman run --quiet --rm cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.4-1  --format json --server-address 10.0.195.98 --server-port 5500 namespace list  --nsid 200 --subsystem nqn.2016-06.io.spdk:cnode2 on 10.0.195.98 timeout 600
2024-04-25 13:46:24,895 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.ceph.py:1602 - Command completed successfully
2024-04-25 13:46:24,896 (cephci.test_ceph_nvmeof_gateway_sub_scale) [INFO] - cephci.IBM.7.1.rhel-9.Regression.18.2.1-149.nvmeotcp.105.cephci.ceph.nvmegw_cli.execute.py:36 - ('', '{\n    "error_message": "Success",\n    "subsystem_nqn": "nqn.2016-06.io.spdk:cnode2",\n    "namespaces": [\n        {\n            "nsid": 200,\n            "bdev_name": "bdev_84e30207-7a60-4657-b126-b2a59d036b76",\n            "rbd_image_name": "L5N6-image200",\n            "rbd_pool_name": "rbd",\n            "load_balancing_group": 1,\n            "block_size": 512,\n            "rbd_image_size": "1099511627776",\n            "uuid": "84e30207-7a60-4657-b126-b2a59d036b76",\n            "rw_ios_per_second": "0",\n            "rw_mbytes_per_second": "0",\n            "r_mbytes_per_second": "0",\n            "w_mbytes_per_second": "0"\n        }\n    ],\n    "status": 0\n}\n')

Comment 7 errata-xmlrpc 2024-06-13 14:31:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925