Bug 2264146 - [NFS Ganesha] Ganesha got crashed and dumped core in ino_release_cb while running the scale test for 1500 exports and 100 clients -
Summary: [NFS Ganesha] Ganesha got crashed and dumped core in ino_release_cb while ru...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: NFS-Ganesha
Version: 7.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 8.1z3
Assignee: Sachin Punadikar
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On: 2365869
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-02-14 07:26 UTC by Manisha Saini
Modified: 2025-06-23 15:22 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-8306 0 None None None 2024-02-14 07:39:00 UTC

Description Manisha Saini 2024-02-14 07:26:29 UTC
Description of problem:
================

While running the scale test for 1500 exports mounted on 100 clients, Ganesha process got Crashed

Tool used : Smallfile
Export per client : 15
Version : v4.1 


Core
==================
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT'.
Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421
421		cm->cm_export->export.up_ops->try_release(
[Current thread is 1 (Thread 0x7f21217fa640 (LWP 2407))]
(gdb) 
(gdb) bt
#0  0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421
#1  0x00007f21946cca58 in C_Client_CacheInvalidate::finish (this=<optimized out>, r=<optimized out>)
    at /usr/src/debug/ceph-18.2.0-122.el9cp.x86_64/src/client/Client.cc:4259
#2  0x00007f21ba406ef5 in boost::wrapexcept<boost::bad_function_call>::clone() const [clone .localalias] [clone .lto_priv.0] ()
   from /usr/lib64/ceph/libceph-common.so.2
#3  0x00007f21bbbd4e5d in syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:37
#4  0x00007f2170087c28 in ?? ()
#5  0x0000000000000000 in ?? ()



=============
/var/log/messages --> shows ganesha process got crashed
==============

Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:84: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring.
Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:86: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring.
Feb 13 18:14:32 cali015 systemd-coredump[2562514]: /etc/systemd/coredump.conf:88: Unknown key name 'DefaultLimitCORE' in section 'Coredump', ignoring.
Feb 13 18:14:40 cali015 systemd-coredump[2562514]: Process 1295300 (ganesha.nfsd) of user 0 dumped core.#012#012Stack trace of thread 2407:#012#0  0x00007f21b9e3a948 n/a (/usr/lib64/ganesha/libfsalceph.so + 0x5948)#012#1  0x0000000000000000 n/a (n/a + 0x0)#012ELF object binary architecture: AMD x86-64
Feb 13 18:14:40 cali015 systemd[1]: systemd-coredump: Deactivated successfully.
Feb 13 18:14:40 cali015 systemd[1]: systemd-coredump: Consumed 7.949s CPU time.
Feb 13 18:14:40 cali015 podman[2562537]: 2024-02-13 18:14:40.771625364 +0000 UTC m=+0.026349856 container died 0059dc90faab577439606913810508047a172f7f8c0c67b9487d747d702e023f (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:46f9e0f44df507d93b089763ece7fcc55207f6b7fa0c5288bd3bee4200cd13e4, name=ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-0-0-cali015-koeyxp, GIT_REPO=https://github.com/ceph/ceph-container.git, description=Red Hat Ceph Storage 7, ceph=True, architecture=x86_64, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, release=160, GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, com.redhat.license_terms=https://www.redhat.com/agreements, io.buildah.version=1.29.0, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-160, build-date=2024-02-09T08:20:41, com.redhat.component=rhceph-container, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, vcs-type=git, name=rhceph, CEPH_POINT_RELEASE=, RELEASE=main, version=7, GIT_CLEAN=True, GIT_BRANCH=main, io.openshift.expose-services=, vendor=Red Hat, Inc., maintainer=Guillaume Abrioux <gabrioux>, io.k8s.description=Red Hat Ceph Storage 7, distribution-scope=public, io.openshift.tags=rhceph ceph, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image.)



Version-Release number of selected component (if applicable):
==========

[ceph: root@cali013 /]# rpm -qa | grep nfs
libnfsidmap-2.5.4-20.el9.x86_64
nfs-utils-2.5.4-20.el9.x86_64
nfs-ganesha-selinux-5.6-4.el9cp.noarch
nfs-ganesha-5.6-4.el9cp.x86_64
nfs-ganesha-rgw-5.6-4.el9cp.x86_64
nfs-ganesha-ceph-5.6-4.el9cp.x86_64
nfs-ganesha-rados-grace-5.6-4.el9cp.x86_64
nfs-ganesha-rados-urls-5.6-4.el9cp.x86_64

[ceph: root@cali013 /]# ceph -- version
ceph version 18.2.1-11.el9cp (97b964affece001761ade86aa09c96242b8ff651) reef (stable)


How reproducible:
=============
1/1


Steps to Reproduce:
===============
1. Deploy NFS ganesha using HA

[ceph: root@cali013 /]# ceph nfs cluster info cephfs-nfs
{
  "cephfs-nfs": {
    "backend": [
      {
        "hostname": "cali015",
        "ip": "10.8.130.15",
        "port": 12049
      },
      {
        "hostname": "cali019",
        "ip": "10.8.130.19",
        "port": 12049
      }
    ],
    "monitor_port": 9049,
    "port": 2049,
    "virtual_ip": "10.8.130.236"
  }
}


2. Create fs volume

[ceph: root@cali013 /]# ceph fs volume ls
[
    {
        "name": "cephfs"
    }
]


3. Create 1500 export using the cephfs volume

[ceph: root@cali013 /]# ceph nfs export ls cephfs-nfs
[
  "/export_1",
  "/export_2",
  "/export_3",  .........> till 1500 (/export_1500)

4. Mount the exports on 100 clients via vers=4.1. Per client mount will be 15 exports

5. Run smallfile IO tool in parallel on 1500 exports from 100 clients


Actual results:
===========
Ganesha process got crashed in between the test.


Expected results:
==========
Ganesha should not crash


Additional info:
=============

Automated run logs for smallfile - http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-YQWKSL/Test_nfs_scale_with_SpecStorage_0.log

Comment 1 Sachin Punadikar 2024-02-16 05:01:22 UTC
Checked the core dump

(gdb) bt
#0  0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421
#1  0x00007f21946cca58 in Client::_async_inode_release (ino=..., this=0x7f2170087360) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/client/Client.cc:4770
#2  C_Client_CacheRelease::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/client/Client.cc:4759
#3  0x00007f2194660ced in Context::complete (this=0x7f20e8022430, r=<optimized out>) at /usr/src/debug/ceph-18.2.1-11.el9cp.x86_64/src/include/Context.h:99
#4  0x00007f21ba406ef5 in Finisher::finisher_thread_entry() () from /usr/lib64/ceph/libceph-common.so.2
#5  0x00007f21bbc35802 in start_thread (arg=<optimized out>) at pthread_create.c:443
#6  0x00007f21bbbd5450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) f 0
#0  0x00007f21b9e3a948 in ino_release_cb (handle=0x7f217001ee20, vino=...) at /usr/src/debug/nfs-ganesha-5.6-4.el9cp.x86_64/src/FSAL/FSAL_CEPH/main.c:421
421		cm->cm_export->export.up_ops->try_release(
(gdb) p *cm
$1 = {
  cm_avl_mount = {
    left = 0x0,
    right = 0x0,
    parent = 2
  },
  cm_exports = {
    next = 0x7f2170034d18,
    prev = 0x7f2173739488
  },
  cm_refcnt = 3001,
  cmount = 0x7f217000ef80,
  cm_fs_name = 0x7f217001eea0 "cephfs",
  cm_mount_path = 0x7f217001eec0 "/",
  cm_user_id = 0x7f217001eee0 "nfs.cephfs-nfs.cephfs",
  cm_secret_key = 0x7f217000ab00 "AQD0hENlFTMaNBAAU3YIxz6Gbm8QvXAxLphO4g==",
  cm_fscid = 37,
  cm_export_id = 1,
  cm_export = 0x0    <= as cm_export is NULL, segfault is occurring.
}

The finisher thread is part of ceph code and it is using stored node information to execute call back in Ganesha code.
As "cm_export" is NULL, segfault is occurring.


Note You need to log in before you can comment on or make changes to this bug.