Bug 2239769

Summary:	[RHCS 7.0] [NFS-Ganesha] RSS usage is not reduced even though all data has been deleted and all clients have unmounted.
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Mohit Bisht <mobisht>
Component:	NFS-Ganesha	Assignee:	Kaleb KEITHLEY <kkeithle>
Status:	CLOSED NOTABUG	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.0	CC:	akraj, cephqe-warriors, ffilz, gouthamr, kkeithle, mbenjamin, msaini, prprakas, rcyriac, sostapov, tserlin, vdas
Target Milestone:	---	Keywords:	Automation
Target Release:	7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	nfs-ganesha-5.6-3.el9cp, rhceph-container-7-113	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-11-17 04:39:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2246077
Bug Blocks:

Comment 10 Manisha Saini 2023-10-25 07:56:49 UTC

This fix is incomplete and is not fully implemented and as we discussed on Slack, we require corresponding changes on the cephadm side to validate the fix. We are moving this issue to the "assigned" state until the necessary cephadm fixes are made available for QA verification.

Comment 11 Manisha Saini 2023-10-25 09:46:59 UTC

Have raised a separate BZ for Cephadm side changes - https://bugzilla.redhat.com/show_bug.cgi?id=2246077 

Marking this BZ as blocked until the fix is available to the Cephadm Bug

Comment 14 Manisha Saini 2023-11-06 05:47:53 UTC

Observation when tested with the fix
========================


The Smallfile tool encountered a crash when ran in loop, which was performed on 10 exports. BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2247762

Upon running Smallfile for a single iteration on the 10 exports, we noticed that the memory usage peaked at 2 GB. Subsequently, even after unmounting the NFS shares from the client, deleting the exports, and cleaning up the mount point, the memory consumption remained unchanged, persisting at the 2 GB consumption. Ideally it should get reduced as no exports are present at this point of stage.



Tested this with

# rpm -qa | grep nfs
libnfsidmap-2.5.4-18.el9.x86_64
nfs-utils-2.5.4-18.el9.x86_64
nfs-ganesha-selinux-5.6-3.el9cp.noarch
nfs-ganesha-5.6-3.el9cp.x86_64
nfs-ganesha-rgw-5.6-3.el9cp.x86_64
nfs-ganesha-ceph-5.6-3.el9cp.x86_64
nfs-ganesha-rados-grace-5.6-3.el9cp.x86_64
nfs-ganesha-rados-urls-5.6-3.el9cp.x86_64


# ceph --version
ceph version 18.2.0-113.el9cp (32cbda69435c7145d09eeaf5b5016e5d46370a5d) reef (stable)


Logs
======

---> Memory consumption post running single iteration of small file on 10 exports ->

Node 1:
************
top - 20:36:42 up 19 days,  1:08,  2 users,  load average: 5.30, 5.44, 5.40
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.3 us,  0.9 sy,  0.0 ni, 92.6 id,  4.7 wa,  0.1 hi,  0.4 si,  0.0 st
MiB Mem :  63758.3 total,    622.7 free,  14848.5 used,  50277.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  48909.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                            
3420347 root      20   0   15.1g   1.9g   8408 S  37.7   3.1 271:18.13 ganesha.nfsd                                                                                       



[root@argo016 ~]# ps -p 3420347 -o pid,%mem,rss,vsz,cmd
    PID %MEM   RSS    VSZ CMD
3420347  3.0 2000100 15809428 /usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT

Node 2:
*************
top - 20:40:16 up 19 days,  1:15,  1 user,  load average: 3.11, 3.04, 2.95
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.0 us,  0.5 sy,  0.0 ni, 92.2 id,  6.1 wa,  0.1 hi,  0.2 si,  0.0 st
MiB Mem :  63770.3 total,    362.6 free,  12475.0 used,  52922.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  51295.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                            
2791842 root      20   0 7550652 762672   8288 S  11.7   1.2 109:57.75 ganesha.nfsd                                                                                       



[root@argo018 ~]# ps -p 2791842 -o pid,%mem,rss,vsz,cmd
    PID %MEM   RSS    VSZ CMD
2791842  1.1 762636 7550652 /usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT


==============

---> After deleting everything from 10 mount points, unmounting the shares from clients, total memory consumption

Node 1:
***********
[root@argo016 ~]# ps -p 3420347 -o pid,%mem,rss,vsz,cmd
    PID %MEM   RSS    VSZ CMD
3420347  3.0 1998652 15651116 /usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT


top - 03:13:25 up 19 days,  7:45,  1 user,  load average: 2.11, 2.09, 2.09
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  63758.3 total,   1352.1 free,  13704.6 used,  50711.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  50053.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                            
3420347 root      20   0   14.9g   1.9g   9816 S   0.0   3.1 280:26.78 ganesha.nfsd     


Node 2:
*************

top - 03:13:59 up 19 days,  7:48,  1 user,  load average: 0.00, 0.02, 0.02
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  63770.3 total,   1139.2 free,  11710.0 used,  52896.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  52060.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                            
2791842 root      20   0 7520840 762044   8080 S   0.0   1.2 111:32.48 ganesha.nfsd    



[root@argo018 ~]# ps -p 2791842 -o pid,%mem,rss,vsz,cmd
    PID %MEM   RSS    VSZ CMD
2791842  1.1 762044 7520840 /usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT


===========

---> Deleted all the exports of ganesha cluster.

# ceph nfs export ls nfsganesha
[
  "/ganesha1",
  "/ganesha2",
  "/ganesha3",
  "/ganesha4",
  "/ganesha5",
  "/ganesha6",
  "/ganesha7",
  "/ganesha8",
  "/ganesha9",
  "/ganesha10",
  "/ganesha11",
  "/ganesha12",
  "/ganesha13",
  "/ganesha14",
  "/ganesha15"
]



# for i in $(seq 1 15); do ceph nfs export rm nfsganesha /ganesha$i;done

# ceph nfs export ls nfsganesha
[]


# ceph nfs cluster ls
[
  "nfsganesha"
]

# ceph nfs cluster info nfsganesha
{
  "nfsganesha": {
    "backend": [
      {
        "hostname": "argo016",
        "ip": "10.8.128.216",
        "port": 2049
      },
      {
        "hostname": "argo018",
        "ip": "10.8.128.218",
        "port": 2049
      }
    ],
    "virtual_ip": null
  }
}

============

---> Total memory consumption post deleting the exports

Node 1
******************
[root@argo016 ~]# top -p 3420347

top - 03:20:37 up 19 days,  7:52,  2 users,  load average: 2.08, 2.08, 2.09
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  63758.3 total,   1281.4 free,  13762.4 used,  50724.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  49995.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                            
3420347 root      20   0   14.9g   1.9g  10224 S   0.3   3.1 280:27.15 ganesha.nfsd  

[root@argo016 ~]# ps -p 3420347 -o pid,%mem,rss,vsz,cmd
    PID %MEM   RSS    VSZ CMD
3420347  3.0 1999152 15651116 /usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT

Node 2
********************
[root@argo018 ~]# ps -p 2791842 -o pid,%mem,rss,vsz,cmd
    PID %MEM   RSS    VSZ CMD
2791842  1.1 762164 7520840 /usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT




The memory usage remains constant and does not decrease as expected.

Comment 15 Kaleb KEITHLEY 2023-11-06 10:06:18 UTC

(In reply to Manisha Saini from comment #14)
> 
> 
> The memory usage remains constant and does not decrease as expected.

The malloc_trim(3) call is run on a 30min timer that starts when ganesha starts. Depending on how long the test takes to run you may need to wait up to 30min to see any memory reduction.

You can also watch the ganesha log file (logging at DEBUG level) and watch for a  "malloc_trim() released some memory" or "malloc_trim() was not able to release memory" message.

Comment 16 Manisha Saini 2023-11-07 04:53:11 UTC

(In reply to Kaleb KEITHLEY from comment #15)
> (In reply to Manisha Saini from comment #14)
> > 
> > 
> > The memory usage remains constant and does not decrease as expected.
> 
> The malloc_trim(3) call is run on a 30min timer that starts when ganesha
> starts. Depending on how long the test takes to run you may need to wait up
> to 30min to see any memory reduction.

The cluster was in same state since yesterday morning.The memory usage remains constant i.e 2GB

[root@argo016 ~]# date
Tue Nov  7 04:52:29 AM UTC 2023

[root@argo016 ~]# ps -p 3420347 -o pid,%mem,rss,vsz,cmd
    PID %MEM   RSS    VSZ CMD
3420347  3.0 1999836 15651116 /usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT


top - 04:52:56 up 20 days,  9:24,  1 user,  load average: 2.09, 2.14, 2.13
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.0 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  63758.3 total,   1626.0 free,  13457.2 used,  50679.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  50301.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                             
3420347 root      20   0   14.9g   1.9g  10240 S   0.0   3.1 281:26.46 ganesha.nfsd   
> 
> You can also watch the ganesha log file (logging at DEBUG level) and watch
> for a  "malloc_trim() released some memory" or "malloc_trim() was not able
> to release memory" message.

Comment 17 Kaleb KEITHLEY 2023-11-07 12:59:25 UTC

(In reply to Manisha Saini from comment #16)
> (In reply to Kaleb KEITHLEY from comment #15)
> > (In reply to Manisha Saini from comment #14)
> > > 
> > > 
> > > The memory usage remains constant and does not decrease as expected.
> > 
> > The malloc_trim(3) call is run on a 30min timer that starts when ganesha
> > starts. Depending on how long the test takes to run you may need to wait up
> > to 30min to see any memory reduction.
> 
> The cluster was in same state since yesterday morning.The memory usage
> remains constant i.e 2GB
> 

It's never going to go to zero. That's a pretty good steady state number and there's probably nothing left to release. Check the logs for the debug-level messages I mentioned.

Comment 20 Manisha Saini 2023-11-09 06:42:20 UTC

(In reply to Kaleb KEITHLEY from comment #15)
> (In reply to Manisha Saini from comment #14)
> > 
> > 
> > The memory usage remains constant and does not decrease as expected.
> 
> The malloc_trim(3) call is run on a 30min timer that starts when ganesha
> starts. Depending on how long the test takes to run you may need to wait up
> to 30min to see any memory reduction.
> 
> You can also watch the ganesha log file (logging at DEBUG level) and watch
> for a  "malloc_trim() released some memory" or "malloc_trim() was not able
> to release memory" message.

Unable to enable debug logging. Raised BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2248812

Comment 21 Frank Filz 2023-11-14 00:42:32 UTC

I'm not sure what the minimum RSS for Ganesha actually is.

We did just discover and fix a leak so this is worth running with the latest build to see what effect that fix has on minimum RSS size.

Comment 25 Scott Ostapovicz 2023-11-17 04:39:02 UTC

As noted in the comments above, this BZ describes a healthy process reaching a stable and sustainable plateau of allocated memory.  I am closing this issue as NOTABUG.

Comment 26 Frank Filz 2023-11-21 18:54:16 UTC

Is this the right BZ to attach doc text about the cmount_path option?

Comment 27 Frank Filz 2023-11-21 18:59:08 UTC

No, this is the malloc trim BZ...

Comment 28 Kaleb KEITHLEY 2023-11-21 19:12:58 UTC

this one? https://bugzilla.redhat.com/show_bug.cgi?id=2248084

Comment 32 Red Hat Bugzilla 2024-03-29 04:25:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days