Bug 2231151 - [perf] nfs-ganesha container OOM killed during perf testing
Summary: [perf] nfs-ganesha container OOM killed during perf testing
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Blaine Gardner
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-10 18:24 UTC by Elvir Kuric
Modified: 2023-08-15 17:47 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Elvir Kuric 2023-08-10 18:24:45 UTC
Created attachment 1982841 [details]
dmesg with oom kill

Description of problem (please be detailed as possible and provide log
snippests):

[I cannot find cephNFS in list to put BZ in that section]

nfs-ganesha container is killed during performance testing. 
Test description:

One fio pod writes in --client mode to multiple pods which mounts PVC from "ocs-storagecluster-ceph-nfs" storageclass. 

fio --client <listclints> 

Short after starting test the pod rook-ceph-nfs-ocs-storagecluster-cephnfs-a-584c957ff-cdbmj will end in CrashLoopBackOff which is caused by OOM of nfs-ganesha container 
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-584c957ff-cdbmj        1/2     CrashLoopBackOff   5 (67s ago)      2d9h

--- 
Thu Aug 10 15:03:57 2023] Tasks state (memory values in pages):
[Thu Aug 10 15:03:57 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Thu Aug 10 15:03:57 2023] [ 124130]     0 124130 13217790  2062728 24743936        0          -997 ganesha.nfsd
[Thu Aug 10 15:03:57 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-3d9ac230a9f61c6d2058c6de8d3d9bfae661b610057fd486d6d767e8bf9021b4.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-podf30594d4_60a0_4447_97be_cad161008265.slice/crio-3d9ac230a9f61c6d2058c6de8d3d9bfae661b610057fd486d6d767e8bf9021b4.scope,task_memcg=/kubepods.slice/kubepods-podf30594d4_60a0_4447_97be_cad161008265.slice/crio-3d9ac230a9f61c6d2058c6de8d3d9bfae661b610057fd486d6d767e8bf9021b4.scope,task=ganesha.nfsd,pid=124130,uid=0
[Thu Aug 10 15:03:57 2023] Memory cgroup out of memory: Killed process 124130 (ganesha.nfsd) total-vm:52871160kB, anon-rss:8229748kB, file-rss:21164kB, shmem-rss:0kB, UID:0 pgtables:24164kB oom_score_adj:-997
--- 

We tested different setups, configurations up to 12 pods worked fine, for cases with more pods, the issue persistet. 
Possible workarond was to increase memory limits for nfs-ganesha in "rook-ceph-nfs-ocs-storagecluster-cephnfs-a" deployment and this helped to mitgate issue for 24 pods and 50 pods - in that cases we increased 
memory limits to 32 GB , 40 GB respectively. 


Version of all relevant components (if applicable):

ceph : ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)

oc get storagecluster -n openshift-storage
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   24d   Ready              2023-07-17T09:34:59Z   4.13.1

OCP v4.13


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes, always. 

Can this issue reproduce from the UI?

NA
If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. enable nfs on top of ODF 
2. create fio pod and direct it to write to pods with mounted pvc from nfs storage class. 
3. monitor "rook-ceph-nfs-ocs-storagecluster-cephnfs-a-" pod in openshift-namespace and once it crashes check logs on node where it scheduled ( dmesg -T )


Actual results:
rook-ceph-nfs-ocs-storagecluster-cephnfs-a- pod crash constanty due OOM 



Additional info:
Same test with ODF cephfs storage class does not show this issue

Comment 2 Blaine Gardner 2023-08-15 17:47:50 UTC
Short update: I'm loosely aware of and working to track down a known NFS-Ganesha memory footprint issue. Ideally, the fix could be made in RHCS. However, we have a short term option of raising the default allocations and/or updating ODF docs to reflect the client connection limitations if the NFS-Ganesha issue can't be fixed for 4.14.


Note You need to log in before you can comment on or make changes to this bug.