Bug 2322352 - On the scale cluster, rm -rf on the NFS mount point is stuck indefinitely when cluster is filled around 90% <eom>
Summary: On the scale cluster, rm -rf on the NFS mount point is stuck indefinitely whe...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.0z3
Assignee: Kotresh HR
QA Contact: Manisha Saini
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-10-29 07:27 UTC by Kotresh HR
Modified: 2025-04-28 17:07 UTC (History)
7 users (show)

Fixed In Version: ceph-19.2.0-108.el9cp
Doc Type: Bug Fix
Doc Text:
.Async I/O operations no longer cause a deadlock in scenarios where OSDs are full (`full osd`) Previously, running async I/O operations, such as `nfs mount`, with the `full osd` state would cause a deadlock. As a result, the async I/O operation would hang indefinitely. With this fix, running async I/O operations in an `full osd` state no longer causes a deadlock and the async I/O operation completes as expected.
Clone Of:
Environment:
Last Closed: 2025-04-07 15:25:50 UTC
Embargoed:
khiremat: needinfo-
khiremat: needinfo-
gfarnum: needinfo-
gfarnum: needinfo-
khiremat: needinfo-
rpollack: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 68641 0 None None None 2024-10-29 07:27:19 UTC
Red Hat Issue Tracker RHCEPH-10139 0 None None None 2024-10-29 07:27:50 UTC
Red Hat Product Errata RHSA-2025:3635 0 None None None 2025-04-07 15:25:52 UTC

Description Kotresh HR 2024-10-29 07:27:20 UTC
This is the copy of the Bug #2291163

I am copying this bug because:
This BZ tracks the deadlock issue discussed in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2291163#c21 while the original Bug #2291163 tracks the rm -rf hang issue which could be different. But the steps to reproduce remains the same.


Description of problem:
============
Copying from the comment https://bugzilla.redhat.com/show_bug.cgi?id=2291163#c13

Scenario-

1. Create NFS cluster
2. Create 2000 exports mapped to 2000 subvolume
3. Mount the 2000 exports on 100 clients via v4.2.
4. Run fio from 100 clients in parallel on 2000 exports.
5. Perform the rm -rf operations on mount point.

Observation:

rm -rf operation is stuck indefinitely and cluster is not freeing up the space.

# ceph -s
  cluster:
    id:     4e687a60-638e-11ee-8772-b49691cee574
    health: HEALTH_ERR
            8 backfillfull osd(s)
            1 full osd(s)
            13 nearfull osd(s)
            9 pool(s) full

  services:
    mon: 1 daemons, quorum cali013 (age 2d)
    mgr: cali013.qakwdk(active, since 2d), standbys: cali016.rhribl, cali015.hvvbwh
    mds: 1/1 daemons up, 1 standby
    osd: 35 osds: 35 up (since 2d), 35 in (since 4w)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 1233 pgs
    objects: 6.53M objects, 25 TiB
    usage:   75 TiB used, 11 TiB / 86 TiB avail
    pgs:     1232 active+clean
             1    active+clean+scrubbing

=======================

Cluster details:
================
Credentials - root/passwd

Installer Node - 10.8.130.13 

NFS is running on 2 nodes -

# ceph nfs cluster info nfsganesha
{
  "nfsganesha": {
    "backend": [
      {
        "hostname": "cali015",
        "ip": "10.8.130.15",
        "port": 12049
      },
      {
        "hostname": "cali016",
        "ip": "10.8.130.16",
        "port": 12049
      }
    ],
    "monitor_port": 9049,
    "port": 2049,
    "virtual_ip": "10.8.130.236"
  }
}

1 of the client on which the rm operation is stuck - 10.0.209.10 (root/passwd)

Requesting to please take a look into the cluster as it is in same state and reproducing the same issue again will be time consuming.

Comment 1 Storage PM bot 2024-10-29 07:27:30 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 19 errata-xmlrpc 2025-04-07 15:25:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:3635


Note You need to log in before you can comment on or make changes to this bug.