Bug 2246306 - Unable to delete block devices from GW
Summary: Unable to delete block devices from GW
Keywords:
Status: CLOSED COMPLETED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: NVMeOF
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.0z3
Assignee: Aviv Caro
QA Contact: Sunil Kumar Nagaraju
ceph-doc-bot
URL:
Whiteboard:
Depends On:
Blocks: 2237662
TreeView+ depends on / blocked
 
Reported: 2023-10-26 09:26 UTC by Sunil Kumar Nagaraju
Modified: 2024-06-13 12:59 UTC (History)
9 users (show)

Fixed In Version: ceph-nvmeof-container-0.0.5-1
Doc Type: Known Issue
Doc Text:
.When using Ceph NVMe-oF gateway, `bdevs` are not deleted during service removal In Ceph NVMe-oF gateway the `podman run -it cp.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:latest --server-address GATEWAY_IP --server-port 5500 delete_bdev` command fails to delete block devices. As a workaround, skip this step during the NVMe-oF service removal.
Clone Of:
Environment:
Last Closed: 2024-06-13 12:59:31 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7802 0 None None None 2023-10-26 09:31:40 UTC

Description Sunil Kumar Nagaraju 2023-10-26 09:26:07 UTC
Created attachment 1995535 [details]
Journal logs of NVMe GW

Created attachment 1995535 [details]
Journal logs of NVMe GW

Description of problem:

Block devices are not being removed from GW using nvmeof.cli command, where we are noticing below error when below command has been executed.

>>>[root@ceph-1sunilkumar-0qegc7-node6 ~]# podman run registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:0.0.4-1 --server-address 10.0.207.37 --server-port 5500 delete_bdev -b bdev_bdev1
>>>usage: python3 -m control.cli [-h] [--server-address SERVER_ADDRESS]
>>>                              [--server-port SERVER_PORT]
>>>                              [--client-key CLIENT_KEY]
>>>                              [--client-cert CLIENT_CERT]
>>>                              [--server-cert SERVER_CERT]
                              {create_bdev,delete_bdev,create_subsystem,delete_subsystem,add_namespace,remove_namespace,add_host,remove_host,create_listener,delete_listener,get_subsystems}
                              ...
python3 -m control.cli: error: delete_bdev failed: code=StatusCode.UNKNOWN message=Exception calling application: 'namespaces'


Journalctl logs:
--------------------
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]: INFO:control.grpc:Received request to delete bdev bdev1
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]: INFO:control.grpc:Received request to get subsystems
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]: INFO:control.grpc:get_subsystems: [{'nqn': 'nqn.2014-08.org.nvmexpress.discovery', 'subtype': 'Discovery', 'listen_addresses': [], 'allow_any_host': True, 'hosts': []}]
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]: ERROR:grpc._server:Exception calling application: 'namespaces'
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]: Traceback (most recent call last):
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]:   File "/usr/local/lib64/python3.9/site-packages/grpc/_server.py", line 494, in _call_behavior
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]:     response_or_iterator = behavior(argument, context)
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]:   File "/remote-source/ceph-nvmeof/app/control/grpc.py", line 139, in delete_bdev
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]:     for namespace in subsystem['namespaces']:
>>>Oct 26 04:50:31 ceph-1sunilkumar-0qegc7-node6 ceph-a6b38f84-73d8-11ee-9371-fa163e50870a-nvmeof-rbd-ceph-1sunilkumar-0qegc7-node6-avgbih[9529]: KeyError: 'namespaces'
>>>


Version-Release number of selected component (if applicable):
# ceph version 
ceph version 18.2.0-100.el9cp (387454a835bec56b46114d4beea978d75ed354eb) reef (stable)

# ceph config dump 
WHO     MASK  LEVEL     OPTION                                 VALUE                                                                                                                         RO                                                            * 
mgr           advanced  mgr/cephadm/container_image_nvmeof     registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:0.0.4-1  



How reproducible: always


Steps to Reproduce:
1. Deploy latest 7.0 cluster
2. deploy NVME GW and addall entities like subsystem, listener, host, bdevs and namespaces.
3. delete the GW entities like listsner, host, namespace, subsystem and then block device, then we can notice the issue mentioned above.

Actual results:
Deletion of block device fails 


Expected results:
Deletion of block device should be successful once it is not part of any subsystem.


Additional info:
http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-0QEGC7/Manage_nvmeof_gateway_entities_0.log

attaching journal logs for reference

Comment 1 Sunil Kumar Nagaraju 2023-10-26 09:37:43 UTC
marking it as regression, as it worked in 18.2.0-70, http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.0-70/Sanity/69/tier-0_nvmeof_sanity/Manage_nvmeof_gateway_entities_0.log


>>>
>>>[root@ceph-1sunilkumar-0qegc7-node6 ~]# podman run registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:0.0.4-1 --server-address 10.0.207.37 --server-port 5500 get_subsystems
>>>INFO:__main__:Get subsystems:
>>>[
>>>    {
>>>        "nqn": "nqn.2014-08.org.nvmexpress.discovery",
>>>        "subtype": "Discovery",
>>>        "listen_addresses": [],
>>>        "allow_any_host": true,
>>>        "hosts": []
>>>    }
>>>]
>>>


>>>[ceph: root@ceph-1sunilkumar-0qegc7-node1-installer /]# rados -p rbd listomapvals nvmeof.None.state
>>>bdev_bdev1
>>>value (153 bytes) :
>>>00000000  7b 0a 20 20 22 62 64 65  76 5f 6e 61 6d 65 22 3a  |{.  "bdev_name":|
>>>00000010  20 22 62 64 65 76 31 22  2c 0a 20 20 22 72 62 64  | "bdev1",.  "rbd|
>>>00000020  5f 70 6f 6f 6c 5f 6e 61  6d 65 22 3a 20 22 72 62  |_pool_name": "rb|
>>>00000030  64 22 2c 0a 20 20 22 72  62 64 5f 69 6d 61 67 65  |d",.  "rbd_image|
>>>00000040  5f 6e 61 6d 65 22 3a 20  22 69 6d 61 67 65 31 22  |_name": "image1"|
>>>00000050  2c 0a 20 20 22 62 6c 6f  63 6b 5f 73 69 7a 65 22  |,.  "block_size"|
>>>00000060  3a 20 35 31 32 2c 0a 20  20 22 75 75 69 64 22 3a  |: 512,.  "uuid":|
>>>00000070  20 22 30 62 62 65 61 65  37 39 2d 62 30 63 36 2d  | "0bbeae79-b0c6-|
>>>00000080  34 63 35 39 2d 38 39 66  62 2d 31 66 33 32 32 37  |4c59-89fb-1f3227|
>>>00000090  61 65 33 61 65 39 22 0a  7d                       |ae3ae9".}|
>>>00000099
>>>
>>>omap_version
>>>value (2 bytes) :
>>>00000000  31 30                                             |10|
>>>00000002
>>>
>>>

Comment 3 Gil Bregman 2023-10-30 17:18:52 UTC
This seems like a problem with some old code which is no longer there. It was re-written in PR 270.

Comment 4 Gil Bregman 2023-10-30 21:35:11 UTC
Looking at the log I see the get_subsystems returned:

[{'nqn': 'nqn.2014-08.org.nvmexpress.discovery', 'subtype': 'Discovery', 'listen_addresses': [], 'allow_any_host': True, 'hosts': []}]

There is no "namespaces" section which caused a KeyError exception in the code when we tried iterating through the namespaces. We can see that the subsystem we have here is a discovery one which we no longer show. I'll try to go back to commit 5b936c613571209c5d28b920eaccb82abff6ac7c, the one before we deleted the discovery subsystem from the get_subsystems output and see if I can reproduce the issue.

Comment 5 Gil Bregman 2023-10-31 10:06:28 UTC
I made sure that the current code no longer has this issue both when "enable_spdk_discovery_controller" is True or False.

Comment 6 Aviv Caro 2023-10-31 13:44:31 UTC
Sunil this is fixed in 0.0.5.

Comment 7 Aviv Caro 2023-11-01 11:08:26 UTC
Fixed in 0.0.5. Please verify.

Comment 13 Sunil Kumar Nagaraju 2023-11-07 08:53:42 UTC
Issue still exists with newer version of ceph and nvmeof images.

023-11-07 13:20:12,015 (cephci.test_nvme_cli) [INFO] - cephci.ceph.ceph.py:1591 - Command completed successfully
2023-11-07 13:20:12,016 (cephci.test_nvme_cli) [DEBUG] - cephci.ceph.nvmeof.nvmeof_gwcli.py:54 - ('', 'INFO:__main__:Deleted subsystem nqn.2016-06.io.spdk:cnode1: True\n')
2023-11-07 13:20:12,018 (cephci.test_nvme_cli) [INFO] - cephci.ceph.ceph.py:1557 - Running command podman run registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:0.0.5-1 --server-address 10.0.211.237 --server-port 5500 delete_bdev  --bdev bdev1 on 10.0.211.237 timeout 600
2023-11-07 13:20:13,705 (cephci.test_nvme_cli) [ERROR] - cephci.ceph.ceph.py:1593 - Error 2 during cmd, timeout 600
2023-11-07 13:20:13,707 (cephci.test_nvme_cli) [ERROR] - cephci.ceph.ceph.py:1594 - usage: python3 -m control.cli [-h] [--server-address SERVER_ADDRESS]
                              [--server-port SERVER_PORT]
                              [--client-key CLIENT_KEY]
                              [--client-cert CLIENT_CERT]
                              [--server-cert SERVER_CERT]
                              {create_bdev,delete_bdev,create_subsystem,delete_subsystem,add_namespace,remove_namespace,add_host,remove_host,create_listener,delete_listener,get_subsystems}
                              ...
python3 -m control.cli: error: delete_bdev failed: code=StatusCode.UNKNOWN message=Exception calling application: 'namespaces'

2023-11-07 13:20:13,709 (cephci.test_nvme_cli) [ERROR] - cephci.tests.nvmeof.test_nvme_cli.py:100 - podman run registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:0.0.5-1 --server-address 10.0.211.237 --server-port 5500 delete_bdev  --bdev bdev1 Error:  usage: python3 -m control.cli [-h] [--server-address SERVER_ADDRESS]
                              [--server-port SERVER_PORT]
                              [--client-key CLIENT_KEY]
                              [--client-cert CLIENT_CERT]
                              [--server-cert SERVER_CERT]
                              {create_bdev,delete_bdev,create_subsystem,delete_subsystem,add_namespace,remove_namespace,add_host,remove_host,create_listener,delete_listener,get_subsystems}
                              ...
python3 -m control.cli: error: delete_bdev failed: code=StatusCode.UNKNOWN message=Exception calling application: 'namespaces'
 10.0.211.237
Traceback (most recent call last):
  File "/home/sunilkumar/workspace/cephci/tests/nvmeof/test_nvme_cli.py", line 98, in run
    func(**cfg["args"])
  File "/home/sunilkumar/workspace/cephci/ceph/nvmeof/nvmeof_gwcli.py", line 71, in delete_block_device
    return self.run_control_cli("delete_bdev", **args)
  File "/home/sunilkumar/workspace/cephci/ceph/nvmeof/nvmeof_gwcli.py", line 50, in run_control_cli
    out = self.node.exec_command(
  File "/home/sunilkumar/workspace/cephci/ceph/ceph.py", line 1595, in exec_command
    raise CommandFailed(
ceph.ceph.CommandFailed: podman run registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:0.0.5-1 --server-address 10.0.211.237 --server-port 5500 delete_bdev  --bdev bdev1 Error:  usage: python3 -m control.cli [-h] [--server-address SERVER_ADDRESS]
                              [--server-port SERVER_PORT]
                              [--client-key CLIENT_KEY]
                              [--client-cert CLIENT_CERT]
                              [--server-cert SERVER_CERT]
                              {create_bdev,delete_bdev,create_subsystem,delete_subsystem,add_namespace,remove_namespace,add_host,remove_host,create_listener,delete_listener,get_subsystems}
                              ...
python3 -m control.cli: error: delete_bdev failed: code=StatusCode.UNKNOWN message=Exception calling application: 'namespaces'



[ceph: root@ceph-1sunilkumar-4q4o0k-node1-installer /]# ceph version 
ceph version 18.2.0-117.el9cp (7e71aaeb77dd63a7bf8cc3f39dd69b7d151298b0) reef (stable)


[ceph: root@ceph-1sunilkumar-4q4o0k-node1-installer /]# ceph config dump | grep nvme
mgr           advanced  mgr/cephadm/container_image_nvmeof     registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:0.0.5-1

Comment 14 Gil Bregman 2023-11-08 15:49:47 UTC
(In reply to Sunil Kumar Nagaraju from comment #13)
This code was removed. You still use an old version. Can you send us the contents of the log you see when you starts the system? We should see the exact version there.

Comment 15 Gil Bregman 2023-11-08 19:15:34 UTC
@sunnagar looking at the history I see that the version was changed to 0.0.5 on 18-Oct but the change for PR #270, which should fix this issue, was done on 19-Oct. So, it's not enough to use version 0.0.5. It should be one which includes the PR #270 fix. As I said above, the log file should have the exact version of the files used. Not only the 0.0.5 version but also the exact changes which are included in that code.

Comment 16 Sunil Kumar Nagaraju 2023-11-13 08:38:57 UTC
Hi Gil,

As we discussed, let the BZ be in assigned state until PR gets merged and verifyable in downstream builds.

-Thanks
Sunil

Comment 17 Scott Ostapovicz 2023-11-22 05:15:56 UTC
It is looking like we may have missed the window for merging this PR and still getting it into 7.0.  Is even a blocker for 7.0?

Comment 18 Aviv Caro 2023-11-22 08:06:32 UTC
It should not be a blocker because this is a TP. We should fix it in 7.0z1. But we should include it in the release notes of 7.0. Who is taking care?

Comment 22 Scott Ostapovicz 2024-04-10 12:59:59 UTC
And again, moving to the next zstream, 7.0 z3


Note You need to log in before you can comment on or make changes to this bug.