Description of problem: ======================= Before deleting the NFS cluster -------------- [ceph: root@argo016 /]# ceph -s cluster: id: 4df85576-edeb-11ef-b573-ac1f6b0a1844 health: HEALTH_OK services: mon: 5 daemons, quorum argo016,argo019,argo021,argo020,argo018 (age 79m) mgr: argo016.omdkzd(active, since 8h), standbys: argo019.jjivxt mds: 1/1 daemons up, 1 standby osd: 14 osds: 14 up (since 78m), 14 in (since 2w) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 9 pools, 721 pgs objects: 117.06k objects, 4.1 GiB usage: 25 GiB used, 21 TiB / 21 TiB avail pgs: 721 active+clean io: client: 85 B/s rd, 401 KiB/s wr, 0 op/s rd, 113 op/s wr Delete the NFS Ganesha cluster -------------- [ceph: root@argo016 /]# ceph nfs cluster ls [ "nfsganesha" ] [ceph: root@argo016 /]# ceph nfs cluster delete nfsganesha Check if the NFS service is deleted --> It is stuck on "deleting" state ---------------- [ceph: root@argo016 /]# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 18m ago 2w count:1 ceph-exporter 5/5 18m ago 2w * crash 5/5 18m ago 2w * grafana ?:3000 1/1 18m ago 2w count:1 ingress.nfs.nfsganesha 10.8.128.200:2049,9049 0/2 <deleting> 8h argo018;argo019;count:1 mds.cephfs 2/2 18m ago 8h count:2 mgr 2/2 18m ago 2w count:2 mon 5/5 18m ago 2w count:5 nfs.nfsganesha ?:12049 0/1 <deleting> 8h argo018;argo019;count:1 node-exporter ?:9100 5/5 18m ago 2w * osd.all-available-devices 14 18m ago 2w * prometheus ?:9095 1/1 18m ago 2w count:1 rgw.rgw.1 ?:80 1/1 17m ago 2w label:rgw Check ceph status ----------- [ceph: root@argo016 /]# ceph -s cluster: id: 4df85576-edeb-11ef-b573-ac1f6b0a1844 health: HEALTH_ERR Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds 35 slow ops, oldest one blocked for 155 sec, daemons [osd.7,osd.9,mon.argo016] have slow ops. services: mon: 5 daemons, quorum argo016,argo019,argo021,argo020,argo018 (age 85m) mgr: argo016.omdkzd(active, since 8h), standbys: argo019.jjivxt mds: 1/1 daemons up, 1 standby osd: 14 osds: 14 up (since 85m), 14 in (since 2w) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 9 pools, 721 pgs objects: 522 objects, 208 MiB usage: 12 GiB used, 21 TiB / 21 TiB avail pgs: 719 active+clean 2 active+clean+laggy Again check ceph -s status after few minutes -------------------- # ceph -s cluster: id: 4df85576-edeb-11ef-b573-ac1f6b0a1844 health: HEALTH_ERR Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds services: mon: 5 daemons, quorum argo016,argo019,argo021,argo020,argo018 (age 99m) mgr: argo016.omdkzd(active, since 9h), standbys: argo019.jjivxt mds: 1/1 daemons up, 1 standby osd: 14 osds: 14 up (since 99m), 14 in (since 2w) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 9 pools, 721 pgs objects: 320 objects, 208 MiB usage: 12 GiB used, 21 TiB / 21 TiB avail pgs: 721 active+clean Check health report --------- [ceph: root@argo016 /]# ceph health detail HEALTH_ERR 1 MDSs report slow metadata IOs; Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds; 37 slow ops, oldest one blocked for 601 sec, daemons [osd.13,osd.7,osd.9,mon.argo016] have slow ops. [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs mds.cephfs.argo020.zdunkj(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 324 secs [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds [WRN] SLOW_OPS: 37 slow ops, oldest one blocked for 601 sec, daemons [osd.13,osd.7,osd.9,mon.argo016] have slow ops. Version-Release number of selected component (if applicable): ================= [ceph: root@argo016 /]# ceph --version ceph version 19.2.0-103.el9cp (4327a2ffce2321e11239a4b0b6f6f96221864b61) squid (stable) [ceph: root@argo016 /]# rpm -qa | grep nfs libnfsidmap-2.5.4-27.el9.x86_64 nfs-utils-2.5.4-27.el9.x86_64 nfs-ganesha-selinux-6.5-4.el9cp.noarch nfs-ganesha-6.5-4.el9cp.x86_64 nfs-ganesha-ceph-6.5-4.el9cp.x86_64 nfs-ganesha-rados-grace-6.5-4.el9cp.x86_64 nfs-ganesha-rados-urls-6.5-4.el9cp.x86_64 nfs-ganesha-rgw-6.5-4.el9cp.x86_64 nfs-ganesha-utils-6.5-4.el9cp.x86_64 How reproducible: ============ 1/1 Steps to Reproduce: ============== 1. Create NFS Ganesha cluster with HA 2. Create the NFS exports and mount it over client 3. Trigger IO's on mount point 4. Perform failover and failback operations 5. Delete the ganesha cluster # ceph nfs cluster delete nfsganesha 6. Check if the process and service is deleted # ceph orch ps | grep nfs # ceph orch ls | grep nfs ingress.nfs.nfsganesha 10.8.128.200:2049,9049 0/2 <deleting> 8h argo018;argo019;count:1 nfs.nfsganesha ?:12049 0/1 <deleting> 8h argo018;argo019;count:1 Actual results: ============= The cluster goes in "HEALTH_ERR" state health: HEALTH_ERR 1 MDSs report slow metadata IOs Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.argo016.omdkzd', '-k', '/var/lib/ceph/mgr/ceph-argo016.omdkzd/keyring', '-p', '.nfs', '--namespace', 'nfsganesha', 'rm', 'grace']' timed out after 10 seconds 4 slow ops, oldest one blocked for 870 sec, daemons [osd.13,osd.7,osd.9] have slow ops. Expected results: =============== NFS ganesha cluster should be deleted sucessfully and "ceph orch ls" should not show nfs process stuck at "deleting" state Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2025:3635