Bug 2352781

Summary:	[NFS-Ganesha] Unexpectedly high memory utilization (31.1 GB) is observed when enabling bandwidth and OPS control limits at both the cluster and export levels in a scaled cluster environment (2000 exports).
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Manisha Saini <msaini>
Component:	NFS-Ganesha	Assignee:	Naresh <nchillar>
Status:	CLOSED ERRATA	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:	Rivka Pollack <rpollack>
Priority:	unspecified
Version:	8.0	CC:	cephqe-warriors, kkeithle, nchillar, rpollack, tserlin, vdas
Target Milestone:	---
Target Release:	8.0z3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	nfs-ganesha-6.5-8.el9cp	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2025-04-07 15:27:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Manisha Saini 2025-03-16 17:06:51 UTC

Description of problem:
========================

On a scale cluster with 2000 nfs exports, while setting the bandwidth_control and ops_control limit at cluster level and export level, high memory utilization is observed.

Note : Only the limits were set and no IO's were performed on these exports.

]# top -p 3279630
top - 17:01:08 up 3 days, 17:31,  1 user,  load average: 0.27, 0.31, 0.37
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 127831.6 total,  79707.4 free,  44779.2 used,   5102.1 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  83052.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3279630 root      20   0   35.2g  31.1g  29568 S  24.7  24.9  26:12.31 ganesha.nfsd


# ps -p 3279630 -o pid,%mem,rss,vsz,cmd
    PID %MEM   RSS    VSZ CMD
3279630 24.9 32609328 36940032 /usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT

Version-Release number of selected component (if applicable):
============================================================

# ceph --version
ceph version 19.2.0-108.el9cp (1762f710a9f63e0304d69ed81ad964841146c93d) squid (stable)

# rpm -qa | grep nfs
libnfsidmap-2.5.4-27.el9.x86_64
nfs-utils-2.5.4-27.el9.x86_64
nfs-ganesha-selinux-6.5-5.el9cp.noarch
nfs-ganesha-6.5-5.el9cp.x86_64
nfs-ganesha-ceph-6.5-5.el9cp.x86_64
nfs-ganesha-rados-grace-6.5-5.el9cp.x86_64
nfs-ganesha-rados-urls-6.5-5.el9cp.x86_64
nfs-ganesha-rgw-6.5-5.el9cp.x86_64
nfs-ganesha-utils-6.5-5.el9cp.x86_64


How reproducible:
================
1/1


Steps to Reproduce:
===================
1. Create NFS Ganesha cluster

# ceph nfs cluster info nfsganesha
{
  "nfsganesha": {
    "backend": [
      {
        "hostname": "cali015",
        "ip": "10.8.130.15",
        "port": 12049
      }
    ],
    "monitor_port": 9049,
    "port": 2049,
    "virtual_ip": "10.8.130.200"
  }
}

2. Create 2000 NFS exports on 2000 subvolume

3. Set the cluster limits for bandwidth_control and ops_control

Cluster level settings →
----------------------
# ceph nfs cluster qos enable bandwidth_control nfsganesha PerShare --max_export_write_bw 2GB --max_export_read_bw 2GB

# ceph nfs cluster qos enable ops_control nfsganesha PerShare --max_export_iops 10000

4. Set the limits for 2000 exports as below

 --> Enable the bandwidth_control for 2000 exports as below
-----------------------------------------------------------

1-500 exports → 
# for i in $(seq 1 500);do ceph nfs export qos enable bandwidth_control nfsganesha /ganeshavol$i --max_export_write_bw 1GB --max_export_read_bw 1GB; done 

501 -1000 exports → 
# for i in $(seq 501 1000);do ceph nfs export qos enable bandwidth_control nfsganesha /ganeshavol$i --max_export_write_bw 2GB --max_export_read_bw 2GB; done

1001 - 1500 exports →
# for i in $(seq 1001 1500);do ceph nfs export qos enable bandwidth_control nfsganesha /ganeshavol$i --max_export_write_bw 3GB --max_export_read_bw 3GB; done

1501 - 2000 exports →
# for i in $(seq 1501 2000);do ceph nfs export qos enable bandwidth_control nfsganesha /ganeshavol$i --max_export_write_bw 4GB --max_export_read_bw 4GB; done

 --> Enable the ops_control for 2000 exports as below
-----------------------------------------------------
# for i in $(seq 1 500);do ceph nfs export qos enable ops_control nfsganesha /ganeshavol$i --max_export_iops 10000; done

# for i in $(seq 501 1000);do ceph nfs export qos enable ops_control nfsganesha /ganeshavol$i --max_export_iops 12000; done

# for i in $(seq 1001 1500);do ceph nfs export qos enable ops_control nfsganesha /ganeshavol$i --max_export_iops 14000; done

# for i in $(seq 1501 2000);do ceph nfs export qos enable ops_control nfsganesha /ganeshavol$i --max_export_iops 16000; done


# ceph nfs export info nfsganesha /ganeshavol1800
{
  "access_type": "RW",
  "clients": [],
  "cluster_id": "nfsganesha",
  "export_id": 1800,
  "fsal": {
    "cmount_path": "/",
    "fs_name": "cephfs",
    "name": "CEPH",
    "user_id": "nfs.nfsganesha.cephfs.2c1043d4"
  },
  "path": "/volumes/ganeshagroup/ganesha1800/fc57d302-43b7-44cb-8461-d69f46b0323a",
  "protocols": [
    3,
    4
  ],
  "pseudo": "/ganeshavol1800",
  "qos_block": {
    "combined_rw_bw_control": false,
    "enable_bw_control": true,
    "enable_iops_control": true,
    "enable_qos": true,
    "max_export_iops": 16000,
    "max_export_read_bw": "4.0GB",
    "max_export_write_bw": "4.0GB"
  },
  "security_label": true,
  "squash": "none",
  "transports": [
    "TCP"
  ]
}

Actual results:
===============
The NFS process was observed to be consuming 31.1 GB of memory. At the time, no exports were mounted on any clients, and no I/O operations were being executed.


Expected results:
=================
The NFS process should utilize a significantly lower amount of memory, especially when no exports are mounted on clients and no I/O operations are running


Additional info:

Comment 4 Naresh 2025-03-17 12:38:59 UTC

Please test same scenario with QoS disabled.

And please test with QoS and with "apply" command instead of "ceph mgr" commands.

Please inform if memory usage increased even with "apply" commands

Comment 12 errata-xmlrpc 2025-04-07 15:27:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:3635

Comment 13 Red Hat Bugzilla 2025-08-06 04:25:17 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days