Bug 2351038

Summary: [NFS-Ganesha] Ceph cluster enters "HEALTH_ERR" state while attempting to delete the NFS-Ganesha cluster, and the NFS service remains stuck in the "deleting" state.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manisha Saini <msaini>
Component: CephadmAssignee: Adam King <adking>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact: Rivka Pollack <rpollack>
Priority: unspecified    
Version: 8.0CC: adking, cephqe-warriors, rpollack, tserlin
Target Milestone: ---   
Target Release: 8.0z3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-19.2.0-113.el9cp Doc Type: Bug Fix
Doc Text:
.The Cephadm module no longer fails during the removal of an NFS service Previously, in some cases, Cephadm did not recognize that a deleted NFS grace file had already been removed. As a result, the NFS service remained in a `deleting` state, causing the Cephadm module to crash. With this fix, NFS services recognize deleted grace files and no longer get stuck in the `deleting` state. As a result, Cephadm module no longer crashes when removing an NFS service.
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-04-07 15:27:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2331703    
Bug Blocks:    

Description Manisha Saini 2025-03-10 07:06:04 UTC
Description of problem:
=======================

Before deleting the NFS cluster
--------------

[ceph: root@argo016 /]# ceph -s
  cluster:
    id:     4df85576-edeb-11ef-b573-ac1f6b0a1844
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum argo016,argo019,argo021,argo020,argo018 (age 79m)
    mgr: argo016.omdkzd(active, since 8h), standbys: argo019.jjivxt
    mds: 1/1 daemons up, 1 standby
    osd: 14 osds: 14 up (since 78m), 14 in (since 2w)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 721 pgs
    objects: 117.06k objects, 4.1 GiB
    usage:   25 GiB used, 21 TiB / 21 TiB avail
    pgs:     721 active+clean

  io:
    client:   85 B/s rd, 401 KiB/s wr, 0 op/s rd, 113 op/s wr


Delete the NFS Ganesha cluster
--------------

[ceph: root@argo016 /]# ceph nfs cluster ls
[
  "nfsganesha"
]
[ceph: root@argo016 /]# ceph nfs cluster delete nfsganesha


Check if the NFS service is deleted --> It is stuck on "deleting" state
----------------

[ceph: root@argo016 /]# ceph orch ls
NAME                       PORTS                   RUNNING  REFRESHED   AGE  PLACEMENT
alertmanager               ?:9093,9094                 1/1  18m ago     2w   count:1
ceph-exporter                                          5/5  18m ago     2w   *
crash                                                  5/5  18m ago     2w   *
grafana                    ?:3000                      1/1  18m ago     2w   count:1
ingress.nfs.nfsganesha     10.8.128.200:2049,9049      0/2  <deleting>  8h   argo018;argo019;count:1
mds.cephfs                                             2/2  18m ago     8h   count:2
mgr                                                    2/2  18m ago     2w   count:2
mon                                                    5/5  18m ago     2w   count:5
nfs.nfsganesha             ?:12049                     0/1  <deleting>  8h   argo018;argo019;count:1
node-exporter              ?:9100                      5/5  18m ago     2w   *
osd.all-available-devices                               14  18m ago     2w   *
prometheus                 ?:9095                      1/1  18m ago     2w   count:1
rgw.rgw.1                  ?:80                        1/1  17m ago     2w   label:rgw


Check ceph status
-----------

[ceph: root@argo016 /]# ceph -s
  cluster:
    id:     4df85576-edeb-11ef-b573-ac1f6b0a1844
    health: HEALTH_ERR
            Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds
            35 slow ops, oldest one blocked for 155 sec, daemons [osd.7,osd.9,mon.argo016] have slow ops.

  services:
    mon: 5 daemons, quorum argo016,argo019,argo021,argo020,argo018 (age 85m)
    mgr: argo016.omdkzd(active, since 8h), standbys: argo019.jjivxt
    mds: 1/1 daemons up, 1 standby
    osd: 14 osds: 14 up (since 85m), 14 in (since 2w)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 721 pgs
    objects: 522 objects, 208 MiB
    usage:   12 GiB used, 21 TiB / 21 TiB avail
    pgs:     719 active+clean
             2   active+clean+laggy

Again check ceph -s status after few minutes
--------------------

# ceph -s
  cluster:
    id:     4df85576-edeb-11ef-b573-ac1f6b0a1844
    health: HEALTH_ERR
            Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds

  services:
    mon: 5 daemons, quorum argo016,argo019,argo021,argo020,argo018 (age 99m)
    mgr: argo016.omdkzd(active, since 9h), standbys: argo019.jjivxt
    mds: 1/1 daemons up, 1 standby
    osd: 14 osds: 14 up (since 99m), 14 in (since 2w)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 721 pgs
    objects: 320 objects, 208 MiB
    usage:   12 GiB used, 21 TiB / 21 TiB avail
    pgs:     721 active+clean


Check health report
---------
[ceph: root@argo016 /]# ceph health detail
HEALTH_ERR 1 MDSs report slow metadata IOs; Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds; 37 slow ops, oldest one blocked for 601 sec, daemons [osd.13,osd.7,osd.9,mon.argo016] have slow ops.
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.cephfs.argo020.zdunkj(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 324 secs
[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds
    Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds
[WRN] SLOW_OPS: 37 slow ops, oldest one blocked for 601 sec, daemons [osd.13,osd.7,osd.9,mon.argo016] have slow ops.


Version-Release number of selected component (if applicable):
=================

[ceph: root@argo016 /]# ceph --version
ceph version 19.2.0-103.el9cp (4327a2ffce2321e11239a4b0b6f6f96221864b61) squid (stable)

[ceph: root@argo016 /]# rpm -qa | grep nfs
libnfsidmap-2.5.4-27.el9.x86_64
nfs-utils-2.5.4-27.el9.x86_64
nfs-ganesha-selinux-6.5-4.el9cp.noarch
nfs-ganesha-6.5-4.el9cp.x86_64
nfs-ganesha-ceph-6.5-4.el9cp.x86_64
nfs-ganesha-rados-grace-6.5-4.el9cp.x86_64
nfs-ganesha-rados-urls-6.5-4.el9cp.x86_64
nfs-ganesha-rgw-6.5-4.el9cp.x86_64
nfs-ganesha-utils-6.5-4.el9cp.x86_64


How reproducible:
============
1/1


Steps to Reproduce:
==============
1. Create NFS Ganesha cluster with HA
2. Create the NFS exports and mount it over client
3. Trigger IO's on mount point
4. Perform failover and failback operations
5. Delete the ganesha cluster
# ceph nfs cluster delete nfsganesha
6. Check if the process and service is deleted 

# ceph orch ps | grep nfs

# ceph orch ls | grep nfs
ingress.nfs.nfsganesha     10.8.128.200:2049,9049      0/2  <deleting>  8h   argo018;argo019;count:1
nfs.nfsganesha             ?:12049                     0/1  <deleting>  8h   argo018;argo019;count:1

Actual results:
=============
The cluster goes in "HEALTH_ERR" state

    health: HEALTH_ERR
            1 MDSs report slow metadata IOs
            Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds
            4 slow ops, oldest one blocked for 870 sec, daemons [osd.13,osd.7,osd.9] have slow ops.


Expected results:
===============
NFS ganesha cluster should be deleted sucessfully and "ceph orch ls" should not show nfs process stuck at "deleting" state


Additional info:

Comment 10 errata-xmlrpc 2025-04-07 15:27:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:3635