Description of problem: ================ While running the scale test for 1500 exports mounted on 100 clients, Ganesha process got Crashed Tool used : Smallfile Export per client : 15 Version : v4.1 Core ================== --Type <RET> for more, q to quit, c to continue without paging-- Core was generated by `/usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421 421 cm->cm_export->export.up_ops->try_release( [Current thread is 1 (Thread 0x7f21217fa640 (LWP 2407))] (gdb) (gdb) bt #0 0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421 #1 0x00007f21946cca58 in C_Client_CacheInvalidate::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-18.2.0-122.el9cp.x86_64/src/client/Client.cc:4259 #2 0x00007f21ba406ef5 in boost::wrapexcept<boost::bad_function_call>::clone() const [clone .localalias] [clone .lto_priv.0] () from /usr/lib64/ceph/libceph-common.so.2 #3 0x00007f21bbbd4e5d in syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:37 #4 0x00007f2170087c28 in ?? () #5 0x0000000000000000 in ?? () ============= /var/log/messages --> shows ganesha process got crashed ============== Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:84: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring. Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:86: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring. Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:88: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring. Feb 13 18:14:40 cali015 systemd-coredump[2562514]: Process 1295300 (ganesha.nfsd) of user 0 dumped core.#012#012Stack trace of thread 2407:#012#0 0x00007f21b9e3a948 n/a (/usr/lib64/ganesha/libfsalceph.so + 0x5948)#012#1 0x0000000000000000 n/a (n/a + 0x0)#012ELF object binary architecture: AMD x86-64 Feb 13 18:14:40 cali015 systemd[1]: systemd-coredump: Deactivated successfully. Feb 13 18:14:40 cali015 systemd[1]: systemd-coredump: Consumed 7.949s CPU time. Feb 13 18:14:40 cali015 podman[2562537]: 2024-02-13 18:14:40.771625364 +0000 UTC m=+0.026349856 container died 0059dc90faab577439606913810508047a172f7f8c0c67b9487d747d702e023f (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:46f9e0f44df507d93b089763ece7fcc55207f6b7fa0c5288bd3bee4200cd13e4, name=ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-0-0-cali015-koeyxp, GIT_REPO=https://github.com/ceph/ceph-container.git, description=Red Hat Ceph Storage 7, ceph=True, architecture=x86_64, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, release=160, GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, com.redhat.license_terms=https://www.redhat.com/agreements, io.buildah.version=1.29.0, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-160, build-date=2024-02-09T08:20:41, com.redhat.component=rhceph-container, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, vcs-type=git, name=rhceph, CEPH_POINT_RELEASE=, RELEASE=main, version=7, GIT_CLEAN=True, GIT_BRANCH=main, io.openshift.expose-services=, vendor=Red Hat, Inc., maintainer=Guillaume Abrioux <gabrioux>, io.k8s.description=Red Hat Ceph Storage 7, distribution-scope=public, io.openshift.tags=rhceph ceph, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image.) Version-Release number of selected component (if applicable): ========== [ceph: root@cali013 /]# rpm -qa | grep nfs libnfsidmap-2.5.4-20.el9.x86_64 nfs-utils-2.5.4-20.el9.x86_64 nfs-ganesha-selinux-5.6-4.el9cp.noarch nfs-ganesha-5.6-4.el9cp.x86_64 nfs-ganesha-rgw-5.6-4.el9cp.x86_64 nfs-ganesha-ceph-5.6-4.el9cp.x86_64 nfs-ganesha-rados-grace-5.6-4.el9cp.x86_64 nfs-ganesha-rados-urls-5.6-4.el9cp.x86_64 [ceph: root@cali013 /]# ceph -- version ceph version 18.2.1-11.el9cp (97b964affece001761ade86aa09c96242b8ff651) reef (stable) How reproducible: ============= 1/1 Steps to Reproduce: =============== 1. Deploy NFS ganesha using HA [ceph: root@cali013 /]# ceph nfs cluster info cephfs-nfs { "cephfs-nfs": { "backend": [ { "hostname": "cali015", "ip": "10.8.130.15", "port": 12049 }, { "hostname": "cali019", "ip": "10.8.130.19", "port": 12049 } ], "monitor_port": 9049, "port": 2049, "virtual_ip": "10.8.130.236" } } 2. Create fs volume [ceph: root@cali013 /]# ceph fs volume ls [ { "name": "cephfs" } ] 3. Create 1500 export using the cephfs volume [ceph: root@cali013 /]# ceph nfs export ls cephfs-nfs [ "/export_1", "/export_2", "/export_3", .........> till 1500 (/export_1500) 4. Mount the exports on 100 clients via vers=4.1. Per client mount will be 15 exports 5. Run smallfile IO tool in parallel on 1500 exports from 100 clients Actual results: =========== Ganesha process got crashed in between the test. Expected results: ========== Ganesha should not crash Additional info: ============= Automated run logs for smallfile - http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-YQWKSL/Test_nfs_scale_with_SpecStorage_0.log
Checked the core dump (gdb) bt #0 0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421 #1 0x00007f21946cca58 in Client::_async_inode_release (ino=..., this=0x7f2170087360) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/client/Client.cc:4770 #2 C_Client_CacheRelease::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/client/Client.cc:4759 #3 0x00007f2194660ced in Context::complete (this=0x7f20e8022430, r=<optimized out>) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/include/Context.h:99 #4 0x00007f21ba406ef5 in Finisher::finisher_thread_entry() () from /usr/lib64/ceph/libceph-common.so.2 #5 0x00007f21bbc35802 in start_thread (arg=<optimized out>) at pthread_create.c:443 #6 0x00007f21bbbd5450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) f 0 #0 0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421 421 cm->cm_export->export.up_ops->try_release( (gdb) p *cm $1 = { cm_avl_mount = { left = 0x0, right = 0x0, parent = 2 }, cm_exports = { next = 0x7f2170034d18, prev = 0x7f2173739488 }, cm_refcnt = 3001, cmount = 0x7f217000ef80, cm_fs_name = 0x7f217001eea0 "cephfs", cm_mount_path = 0x7f217001eec0 "/", cm_user_id = 0x7f217001eee0 "nfs.cephfs-nfs.cephfs", cm_secret_key = 0x7f217000ab00 "AQD0hENlFTMaNBAAU3YIxz6Gbm8QvXAxLphO4g==", cm_fscid = 37, cm_export_id = 1, cm_export = 0x0 <= as cm_export is NULL, segfault is occurring. } The finisher thread is part of ceph code and it is using stored node information to execute call back in Ganesha code. As "cm_export" is NULL, segfault is occurring.