Description of problem: ============ Test setup Details - 2000 NFS exports mapped on 2000 subvolume in the backend.Mounted the exports on 100 clients ( 1 client has 20 mounts) via v4.1 protocol. Ran FIO in parallel on all the exports from 100 clients. Post FIO, cleanup was performed by performing rm -rf /mnt/<nfs_mount_point>/*. The mount point were 100% filled. While the cleanup was running, the rm -rf operation has been hung for over 12 hours and remains in this state. Given this situation, what potential solutions can be implemented to resolve this issue, aside from adding additional OSDs? On checking the logs, observed that the NFS container died and restarted in ganesha.log twice while cleanup was running. Ceph health status-- ======= [ceph: root@cali013 /]# ceph -s cluster: id: 4e687a60-638e-11ee-8772-b49691cee574 health: HEALTH_WARN 19 backfillfull osd(s) 11 nearfull osd(s) 9 pool(s) backfillfull services: mon: 1 daemons, quorum cali013 (age 4d) mgr: cali013.qakwdk(active, since 4d), standbys: cali016.rhribl, cali015.hvvbwh mds: 1/1 daemons up, 1 standby osd: 35 osds: 35 up (since 4d), 35 in (since 6w) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 9 pools, 1233 pgs objects: 6.74M objects, 26 TiB usage: 77 TiB used, 9.1 TiB / 86 TiB avail pgs: 1233 active+clean io: client: 170 B/s rd, 0 op/s rd, 0 op/s wr [ceph: root@cali013 /]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 44 TiB 4.7 TiB 39 TiB 39 TiB 89.29 ssd 42 TiB 4.4 TiB 38 TiB 38 TiB 89.60 TOTAL 86 TiB 9.1 TiB 77 TiB 77 TiB 89.44 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 7.1 MiB 3 21 MiB 0 91 GiB .rgw.root 2 32 2.6 KiB 6 72 KiB 0 91 GiB default.rgw.log 3 32 3.6 KiB 209 408 KiB 0 91 GiB default.rgw.control 4 32 0 B 8 0 B 0 91 GiB default.rgw.meta 5 32 1.6 KiB 3 28 KiB 0 91 GiB rbd 8 32 19 B 1 12 KiB 0 91 GiB .nfs 9 32 1.1 MiB 2.00k 24 MiB 0 91 GiB cephfs.cephfs.meta 94 16 263 MiB 4.09k 789 MiB 0.28 91 GiB cephfs.cephfs.data 95 1024 26 TiB 6.73M 77 TiB 99.65 91 GiB [ceph: root@cali013 /]# ceph health detail HEALTH_WARN 19 backfillfull osd(s); 11 nearfull osd(s); 9 pool(s) backfillfull [WRN] OSD_BACKFILLFULL: 19 backfillfull osd(s) osd.0 is backfill full osd.1 is backfill full osd.2 is backfill full osd.5 is backfill full osd.7 is backfill full osd.9 is backfill full osd.10 is backfill full osd.14 is backfill full osd.16 is backfill full osd.17 is backfill full osd.18 is backfill full osd.19 is backfill full osd.22 is backfill full osd.27 is backfill full osd.28 is backfill full osd.30 is backfill full osd.31 is backfill full osd.32 is backfill full osd.34 is backfill full [WRN] OSD_NEARFULL: 11 nearfull osd(s) osd.6 is near full osd.8 is near full osd.11 is near full osd.12 is near full osd.13 is near full osd.15 is near full osd.20 is near full osd.21 is near full osd.23 is near full osd.25 is near full osd.26 is near full [WRN] POOL_BACKFILLFULL: 9 pool(s) backfillfull pool '.mgr' is backfillfull pool '.rgw.root' is backfillfull pool 'default.rgw.log' is backfillfull pool 'default.rgw.control' is backfillfull pool 'default.rgw.meta' is backfillfull pool 'rbd' is backfillfull pool '.nfs' is backfillfull pool 'cephfs.cephfs.meta' is backfillfull pool 'cephfs.cephfs.data' is backfillfull Version-Release number of selected component (if applicable): =========================== [ceph: root@cali013 /]# ceph --version ceph version 18.2.1-194.el9cp (04a992766839cd3207877e518a1238cdbac3787e) reef (stable) [ceph: root@cali013 /]# rpm -qa | grep nfs libnfsidmap-2.5.4-25.el9.x86_64 nfs-utils-2.5.4-25.el9.x86_64 nfs-ganesha-selinux-5.7-5.el9cp.noarch nfs-ganesha-5.7-5.el9cp.x86_64 nfs-ganesha-rgw-5.7-5.el9cp.x86_64 nfs-ganesha-ceph-5.7-5.el9cp.x86_64 nfs-ganesha-rados-grace-5.7-5.el9cp.x86_64 nfs-ganesha-rados-urls-5.7-5.el9cp.x86_64 How reproducible: ============ 1/1 Steps to Reproduce: 1. Create NFS cluster on 2 nodes 2. Create 1 subvolume group and create 2000 subvolume 3. Mount the 2000 exports on 100 clients 4. Run fio on all 2000 exports in parallel 5. Stop fio and Perform rm -rf * on mount point one by one Actual results: ============ rm operations are hung indefinitely Expected results: ============ rm should be completed and space should be freed up to bring back the cluster in HEALTHY state Additional info: ============= Client mount shows 100% filled. [root@ceph-nfs-client-ymkppj-node16 ~]# df -hT Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 4.0M 0 4.0M 0% /dev tmpfs tmpfs 1.8G 0 1.8G 0% /dev/shm tmpfs tmpfs 732M 13M 720M 2% /run /dev/vda4 xfs 40G 2.8G 37G 8% / /dev/vda3 xfs 495M 287M 209M 58% /boot /dev/vda2 vfat 200M 7.1M 193M 4% /boot/efi 10.8.130.236:/export_621 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_621 10.8.130.236:/export_622 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_622 10.8.130.236:/export_623 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_623 10.8.130.236:/export_624 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_624 10.8.130.236:/export_625 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_625 10.8.130.236:/export_626 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_626 10.8.130.236:/export_627 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_627 10.8.130.236:/export_628 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_628 10.8.130.236:/export_629 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_629 10.8.130.236:/export_630 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_630 10.8.130.236:/export_631 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_631 10.8.130.236:/export_632 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_632 10.8.130.236:/export_633 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_633 10.8.130.236:/export_634 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_634 10.8.130.236:/export_635 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_635 10.8.130.236:/export_636 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_636 10.8.130.236:/export_637 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_637 10.8.130.236:/export_638 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_638 10.8.130.236:/export_639 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_639 10.8.130.236:/export_640 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_640 10.8.130.236:/export_661 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_661 10.8.130.236:/export_662 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_662 10.8.130.236:/export_663 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_663 10.8.130.236:/export_664 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_664 10.8.130.236:/export_665 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_665 10.8.130.236:/export_666 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_666 10.8.130.236:/export_667 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_667 10.8.130.236:/export_668 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_668 10.8.130.236:/export_669 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_669 10.8.130.236:/export_670 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_670 10.8.130.236:/export_671 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_671 10.8.130.236:/export_672 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_672 10.8.130.236:/export_673 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_673 10.8.130.236:/export_674 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_674 10.8.130.236:/export_675 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_675 10.8.130.236:/export_676 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_676 10.8.130.236:/export_677 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_677 10.8.130.236:/export_678 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_678 10.8.130.236:/export_679 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_679 10.8.130.236:/export_680 nfs4 26T 26T 91G 100% /mnt/nfs_scale_fio_680 tmpfs tmpfs 366M 0 366M 0% /run/user/0 tmpfs tmpfs 366M 0 366M 0% /run/user/1000
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.