Bug 2264146

Summary: [NFS Ganesha] Ganesha got crashed and dumped core in ino_release_cb while running the scale test for 1500 exports and 100 clients -
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manisha Saini <msaini>
Component: NFS-GaneshaAssignee: Sachin Punadikar <spunadik>
Status: ASSIGNED --- QA Contact: Manisha Saini <msaini>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.1CC: cephqe-warriors, gouthamr, kkeithle, vdas
Target Milestone: ---Keywords: Automation
Target Release: 8.1z3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2365869    
Bug Blocks:    

Description Manisha Saini 2024-02-14 07:26:29 UTC
Description of problem:
================

While running the scale test for 1500 exports mounted on 100 clients, Ganesha process got Crashed

Tool used : Smallfile
Export per client : 15
Version : v4.1 


Core
==================
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT'.
Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421
421		cm->cm_export->export.up_ops->try_release(
[Current thread is 1 (Thread 0x7f21217fa640 (LWP 2407))]
(gdb) 
(gdb) bt
#0  0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421
#1  0x00007f21946cca58 in C_Client_CacheInvalidate::finish (this=<optimized out>, r=<optimized out>)
    at /usr/src/debug/ceph-18.2.0-122.el9cp.x86_64/src/client/Client.cc:4259
#2  0x00007f21ba406ef5 in boost::wrapexcept<boost::bad_function_call>::clone() const [clone .localalias] [clone .lto_priv.0] ()
   from /usr/lib64/ceph/libceph-common.so.2
#3  0x00007f21bbbd4e5d in syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:37
#4  0x00007f2170087c28 in ?? ()
#5  0x0000000000000000 in ?? ()



=============
/var/log/messages --> shows ganesha process got crashed
==============

Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:84: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring.
Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:86: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring.
Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:88: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring.
Feb 13 18:14:40 cali015 systemd-coredump[2562514]: Process 1295300 (ganesha.nfsd) of user 0 dumped core.#012#012Stack trace of thread 2407:#012#0  0x00007f21b9e3a948 n/a (/usr/lib64/ganesha/libfsalceph.so + 0x5948)#012#1  0x0000000000000000 n/a (n/a + 0x0)#012ELF object binary architecture: AMD x86-64
Feb 13 18:14:40 cali015 systemd[1]: systemd-coredump: Deactivated successfully.
Feb 13 18:14:40 cali015 systemd[1]: systemd-coredump: Consumed 7.949s CPU time.
Feb 13 18:14:40 cali015 podman[2562537]: 2024-02-13 18:14:40.771625364 +0000 UTC m=+0.026349856 container died 0059dc90faab577439606913810508047a172f7f8c0c67b9487d747d702e023f (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:46f9e0f44df507d93b089763ece7fcc55207f6b7fa0c5288bd3bee4200cd13e4, name=ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-0-0-cali015-koeyxp, GIT_REPO=https://github.com/ceph/ceph-container.git, description=Red Hat Ceph Storage 7, ceph=True, architecture=x86_64, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, release=160, GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, com.redhat.license_terms=https://www.redhat.com/agreements, io.buildah.version=1.29.0, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-160, build-date=2024-02-09T08:20:41, com.redhat.component=rhceph-container, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, vcs-type=git, name=rhceph, CEPH_POINT_RELEASE=, RELEASE=main, version=7, GIT_CLEAN=True, GIT_BRANCH=main, io.openshift.expose-services=, vendor=Red Hat, Inc., maintainer=Guillaume Abrioux <gabrioux>, io.k8s.description=Red Hat Ceph Storage 7, distribution-scope=public, io.openshift.tags=rhceph ceph, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image.)



Version-Release number of selected component (if applicable):
==========

[ceph: root@cali013 /]# rpm -qa | grep nfs
libnfsidmap-2.5.4-20.el9.x86_64
nfs-utils-2.5.4-20.el9.x86_64
nfs-ganesha-selinux-5.6-4.el9cp.noarch
nfs-ganesha-5.6-4.el9cp.x86_64
nfs-ganesha-rgw-5.6-4.el9cp.x86_64
nfs-ganesha-ceph-5.6-4.el9cp.x86_64
nfs-ganesha-rados-grace-5.6-4.el9cp.x86_64
nfs-ganesha-rados-urls-5.6-4.el9cp.x86_64

[ceph: root@cali013 /]# ceph -- version
ceph version 18.2.1-11.el9cp (97b964affece001761ade86aa09c96242b8ff651) reef (stable)


How reproducible:
=============
1/1


Steps to Reproduce:
===============
1. Deploy NFS ganesha using HA

[ceph: root@cali013 /]# ceph nfs cluster info cephfs-nfs
{
  "cephfs-nfs": {
    "backend": [
      {
        "hostname": "cali015",
        "ip": "10.8.130.15",
        "port": 12049
      },
      {
        "hostname": "cali019",
        "ip": "10.8.130.19",
        "port": 12049
      }
    ],
    "monitor_port": 9049,
    "port": 2049,
    "virtual_ip": "10.8.130.236"
  }
}


2. Create fs volume

[ceph: root@cali013 /]# ceph fs volume ls
[
    {
        "name": "cephfs"
    }
]


3. Create 1500 export using the cephfs volume

[ceph: root@cali013 /]# ceph nfs export ls cephfs-nfs
[
  "/export_1",
  "/export_2",
  "/export_3",  .........> till 1500 (/export_1500)

4. Mount the exports on 100 clients via vers=4.1. Per client mount will be 15 exports

5. Run smallfile IO tool in parallel on 1500 exports from 100 clients


Actual results:
===========
Ganesha process got crashed in between the test.


Expected results:
==========
Ganesha should not crash


Additional info:
=============

Automated run logs for smallfile - http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-YQWKSL/Test_nfs_scale_with_SpecStorage_0.log

Comment 1 Sachin Punadikar 2024-02-16 05:01:22 UTC
Checked the core dump

(gdb) bt
#0  0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421
#1  0x00007f21946cca58 in Client::_async_inode_release (ino=..., this=0x7f2170087360) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/client/Client.cc:4770
#2  C_Client_CacheRelease::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/client/Client.cc:4759
#3  0x00007f2194660ced in Context::complete (this=0x7f20e8022430, r=<optimized out>) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/include/Context.h:99
#4  0x00007f21ba406ef5 in Finisher::finisher_thread_entry() () from /usr/lib64/ceph/libceph-common.so.2
#5  0x00007f21bbc35802 in start_thread (arg=<optimized out>) at pthread_create.c:443
#6  0x00007f21bbbd5450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) f 0
#0  0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421
421		cm->cm_export->export.up_ops->try_release(
(gdb) p *cm
$1 = {
  cm_avl_mount = {
    left = 0x0,
    right = 0x0,
    parent = 2
  },
  cm_exports = {
    next = 0x7f2170034d18,
    prev = 0x7f2173739488
  },
  cm_refcnt = 3001,
  cmount = 0x7f217000ef80,
  cm_fs_name = 0x7f217001eea0 "cephfs",
  cm_mount_path = 0x7f217001eec0 "/",
  cm_user_id = 0x7f217001eee0 "nfs.cephfs-nfs.cephfs",
  cm_secret_key = 0x7f217000ab00 "AQD0hENlFTMaNBAAU3YIxz6Gbm8QvXAxLphO4g==",
  cm_fscid = 37,
  cm_export_id = 1,
  cm_export = 0x0    <= as cm_export is NULL, segfault is occurring.
}

The finisher thread is part of ceph code and it is using stored node information to execute call back in Ganesha code.
As "cm_export" is NULL, segfault is occurring.