2322352 – On the scale cluster, rm -rf on the NFS mount point is stuck indefinitely when cluster is filled around 90% <eom>

Bug 2322352 - On the scale cluster, rm -rf on the NFS mount point is stuck indefinitely when cluster is filled around 90% <eom>

Summary: On the scale cluster, rm -rf on the NFS mount point is stuck indefinitely whe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	8.0z3
Assignee:	Kotresh HR
QA Contact:	Manisha Saini
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-10-29 07:27 UTC by Kotresh HR
Modified:	2025-04-28 17:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ceph-19.2.0-108.el9cp
Doc Type:	Bug Fix
Doc Text:	.Async I/O operations no longer cause a deadlock in scenarios where OSDs are full (`full osd`) Previously, running async I/O operations, such as `nfs mount`, with the `full osd` state would cause a deadlock. As a result, the async I/O operation would hang indefinitely. With this fix, running async I/O operations in an `full osd` state no longer causes a deadlock and the async I/O operation completes as expected.
Clone Of:
Environment:
Last Closed:	2025-04-07 15:25:50 UTC
Embargoed:
Dependent Products:
Flags:	khiremat: needinfo- khiremat: needinfo- gfarnum: needinfo- gfarnum: needinfo- khiremat: needinfo- rpollack: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	68641	None	None	None	2024-10-29 07:27:19 UTC
Red Hat Issue Tracker	RHCEPH-10139	None	None	None	2024-10-29 07:27:50 UTC
Red Hat Product Errata	RHSA-2025:3635	None	None	None	2025-04-07 15:25:52 UTC

Description Kotresh HR 2024-10-29 07:27:20 UTC

This is the copy of the Bug #2291163

I am copying this bug because:
This BZ tracks the deadlock issue discussed in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2291163#c21 while the original Bug #2291163 tracks the rm -rf hang issue which could be different. But the steps to reproduce remains the same.


Description of problem:
============
Copying from the comment https://bugzilla.redhat.com/show_bug.cgi?id=2291163#c13

Scenario-

1. Create NFS cluster
2. Create 2000 exports mapped to 2000 subvolume
3. Mount the 2000 exports on 100 clients via v4.2.
4. Run fio from 100 clients in parallel on 2000 exports.
5. Perform the rm -rf operations on mount point.

Observation:

rm -rf operation is stuck indefinitely and cluster is not freeing up the space.

# ceph -s
  cluster:
    id:     4e687a60-638e-11ee-8772-b49691cee574
    health: HEALTH_ERR
            8 backfillfull osd(s)
            1 full osd(s)
            13 nearfull osd(s)
            9 pool(s) full

  services:
    mon: 1 daemons, quorum cali013 (age 2d)
    mgr: cali013.qakwdk(active, since 2d), standbys: cali016.rhribl, cali015.hvvbwh
    mds: 1/1 daemons up, 1 standby
    osd: 35 osds: 35 up (since 2d), 35 in (since 4w)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 1233 pgs
    objects: 6.53M objects, 25 TiB
    usage:   75 TiB used, 11 TiB / 86 TiB avail
    pgs:     1232 active+clean
             1    active+clean+scrubbing

=======================

Cluster details:
================
Credentials - root/passwd

Installer Node - 10.8.130.13 

NFS is running on 2 nodes -

# ceph nfs cluster info nfsganesha
{
  "nfsganesha": {
    "backend": [
      {
        "hostname": "cali015",
        "ip": "10.8.130.15",
        "port": 12049
      },
      {
        "hostname": "cali016",
        "ip": "10.8.130.16",
        "port": 12049
      }
    ],
    "monitor_port": 9049,
    "port": 2049,
    "virtual_ip": "10.8.130.236"
  }
}

1 of the client on which the rm operation is stuck - 10.0.209.10 (root/passwd)

Requesting to please take a look into the cluster as it is in same state and reproducing the same issue again will be time consuming.

Comment 1 Storage PM bot 2024-10-29 07:27:30 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 19 errata-xmlrpc 2025-04-07 15:25:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:3635

Note You need to log in before you can comment on or make changes to this bug.