Bug 2247762 - NFS Ganesha crashed with "FAILED ceph_assert(cur->_list))" while running small file tool in loop [NEEDINFO]
Summary: NFS Ganesha crashed with "FAILED ceph_assert(cur->_list))" while running sma...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: NFS-Ganesha
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 8.1z2
Assignee: Sachin Punadikar
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On: 2365869
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-03 08:46 UTC by Manisha Saini
Modified: 2025-06-23 15:19 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
vshankar: needinfo? (spunadik)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7850 0 None None None 2023-11-03 08:50:01 UTC

Description Manisha Saini 2023-11-03 08:46:52 UTC
Description of problem:
=============

NFS Ganesha daemon crashed while running small file tool in loop from multiple clients and multiple exports

ganesha.log
===========
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: /builddir/build/BUILD/ceph-18.2.0/src/include/xlist.h: In function 'xlist<T>::const_iterator& xlist<T>::const_iterator::operator++() [with T = Dentry*]' thread 7fa9796eb640 time 2023-11-03T00:08:12.895127+0000
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: /builddir/build/BUILD/ceph-18.2.0/src/include/xlist.h: 205: FAILED ceph_assert(cur->_list)
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  ceph version 18.2.0-113.el9cp (32cbda69435c7145d09eeaf5b5016e5d46370a5d) reef (stable)
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7faa050e2f57]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  2: /usr/lib64/ceph/libceph-common.so.2(+0x161115) [0x7faa050e3115]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  3: /lib64/libcephfs.so.2(+0x1058be) [0x7faa001ad8be]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  4: ceph_ll_sync_inode()
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  5: /usr/lib64/ganesha/libfsalceph.so(+0xc571) [0x7faa04bd3571]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  6: /lib64/libganesha_nfsd.so.5.6(+0x106c16) [0x7faa06c34c16]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  7: fsal_setattr()
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  8: /lib64/libganesha_nfsd.so.5.6(+0xddeed) [0x7faa06c0beed]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  9: /lib64/libganesha_nfsd.so.5.6(+0xbedfd) [0x7faa06becdfd]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  10: /lib64/libganesha_nfsd.so.5.6(+0xc3048) [0x7faa06bf1048]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  11: /lib64/libganesha_nfsd.so.5.6(+0x57b66) [0x7faa06b85b66]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  12: /lib64/libntirpc.so.5.0(+0x1d387) [0x7faa068fd387]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  13: /lib64/libntirpc.so.5.0(+0x2178a) [0x7faa0690178a]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  14: /lib64/libntirpc.so.5.0(+0x24fcb) [0x7faa06904fcb]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  15: /lib64/libntirpc.so.5.0(+0x2e95c) [0x7faa0690e95c]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  16: /lib64/libc.so.6(+0x9f802) [0x7faa069c4802]
Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]:  17: /lib64/libc.so.6(+0x3f450) [0x7faa06964450]
Nov 03 00:08:19 argo016 podman[2175299]: 2023-11-03 00:08:19.201051869 +0000 UTC m=+0.032517602 container died 65c57a779e96586a6950480f2be4cb3733916b83a52f76d04e46b4e2c3d27d79 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:c4f1cf9ac78d873050a567493eb3a4956c661a1fdb52cf58ba05c995936f5ba0, name=ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd, distribution-scope=public, architecture=x86_64, vcs-type=git, GIT_CLEAN=True, release=120, description=Red Hat Ceph Storage 7, io.buildah.version=1.29.0, GIT_BRANCH=main, maintainer=Guillaume Abrioux <gabrioux>, name=rhceph, RELEASE=main, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, com.redhat.component=rhceph-container, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, build-date=2023-11-01T22:27:47, version=7, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image., io.openshift.expose-services=, vendor=Red Hat, Inc., GIT_REPO=https://github.com/ceph/ceph-container.git, GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, com.redhat.license_terms=https://www.redhat.com/agreements, io.openshift.tags=rhceph ceph, CEPH_POINT_RELEASE=, ceph=True, io.k8s.description=Red Hat Ceph Storage 7, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-120)
Nov 03 00:08:19 argo016 podman[2175299]: 2023-11-03 00:08:19.21832443 +0000 UTC m=+0.049790163 container remove 65c57a779e96586a6950480f2be4cb3733916b83a52f76d04e46b4e2c3d27d79 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:c4f1cf9ac78d873050a567493eb3a4956c661a1fdb52cf58ba05c995936f5ba0, name=ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd, distribution-scope=public, GIT_BRANCH=main, RELEASE=main, description=Red Hat Ceph Storage 7, ceph=True, maintainer=Guillaume Abrioux <gabrioux>, name=rhceph, GIT_REPO=https://github.com/ceph/ceph-container.git, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, io.openshift.expose-services=, version=7, io.k8s.description=Red Hat Ceph Storage 7, com.redhat.component=rhceph-container, vcs-type=git, vendor=Red Hat, Inc., com.redhat.license_terms=https://www.redhat.com/agreements, CEPH_POINT_RELEASE=, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image., GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-120, build-date=2023-11-01T22:27:47, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, architecture=x86_64, io.openshift.tags=rhceph ceph, release=120, io.buildah.version=1.29.0, GIT_CLEAN=True)
Nov 03 00:08:19 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Main process exited, code=exited, status=134/n/a
Nov 03 00:08:19 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Failed with result 'exit-code'.
Nov 03 00:08:19 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Consumed 1h 31min 3.649s CPU time.
Nov 03 00:08:29 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Scheduled restart job, restart counter is at 1.
Nov 03 00:08:29 argo016 systemd[1]: Stopped Ceph nfs.nfsganesha.0.0.argo016.faezmd for 348978a2-72b5-11ee-88f8-ac1f6b0a1844.
Nov 03 00:08:29 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Consumed 1h 31min 3.649s CPU time.
Nov 03 00:08:29 argo016 systemd[1]: Starting Ceph nfs.nfsganesha.0.0.argo016.faezmd for 348978a2-72b5-11ee-88f8-ac1f6b0a1844...



Version-Release number of selected component (if applicable):
=====================
# rpm -qa | grep nfs
libnfsidmap-2.5.4-18.el9.x86_64
nfs-utils-2.5.4-18.el9.x86_64
nfs-ganesha-selinux-5.6-3.el9cp.noarch
nfs-ganesha-5.6-3.el9cp.x86_64
nfs-ganesha-rgw-5.6-3.el9cp.x86_64
nfs-ganesha-ceph-5.6-3.el9cp.x86_64
nfs-ganesha-rados-grace-5.6-3.el9cp.x86_64
nfs-ganesha-rados-urls-5.6-3.el9cp.x86_64

# ceph --version
ceph version 18.2.0-113.el9cp (32cbda69435c7145d09eeaf5b5016e5d46370a5d) reef (stable)


How reproducible:
================
1/1


Steps to Reproduce:
=============

1.Create nfs ganesha cluster
# ceph nfs cluster info nfsganesha
{
  "nfsganesha": {
    "backend": [
      {
        "hostname": "argo016",
        "ip": "10.8.128.216",
        "port": 2049
      },
      {
        "hostname": "argo018",
        "ip": "10.8.128.218",
        "port": 2049
      }
    ],
    "virtual_ip": null
  }
}

2.Create cephfs filesystem, 1 subvol group and 15 subvolumes

# ceph fs subvolume ls cephfs --group_name ganeshagroup
[
    {
        "name": "ganesha4"
    },
    {
        "name": "ganesha8"
    },
    {
        "name": "ganesha15"
    },
    {
        "name": "ganesha1"
    },
    {
        "name": "ganesha2"
    },
    {
        "name": "ganesha13"
    },
    {
        "name": "ganesha6"
    },
    {
        "name": "ganesha11"
    },
    {
        "name": "ganesha14"
    },
    {
        "name": "ganesha9"
    },
    {
        "name": "ganesha10"
    },
    {
        "name": "ganesha5"
    },
    {
        "name": "ganesha12"
    },
    {
        "name": "ganesha7"
    },
    {
        "name": "ganesha3"
    }
]

3.Create ganesha exports out of these subvolumes
# ceph nfs export ls nfsganesha           
[
  "/ganesha1",
  "/ganesha2",
  "/ganesha3",
  "/ganesha4",
  "/ganesha5",
  "/ganesha6",
  "/ganesha7",
  "/ganesha8",
  "/ganesha9",
  "/ganesha10",
  "/ganesha11",
  "/ganesha12",
  "/ganesha13",
  "/ganesha14",
  "/ganesha15"
]

4.Mount 10 exports on 10 clients and run small file tool in loop

# for i in $(seq 1 10); do python3 smallfile_cli.py --operation create --threads 4 --file-size 4194 --files 1024 --files-per-dir 10 --dirs-per-dir 2 --record-size 128 --top /mnt/ganesha/smallfile$i --output-json=create.json;done

Actual results:
=======
NFS-Ganesha daemon crashed


Expected results:
======
No crashes should be observed


Additional info:
========
Core file generated
---------------
[root@argo016 coredump]# ls
'core.ganesha\x2enfsd.0.80243653daa041879ce857169c3a9343.2072834.1698970092000000'  'core.ganesha\x2enfsd.0.80243653daa041879ce857169c3a9343.2072834.1698970092000000.zst'
[root@argo016 coredump]# 


Unable to bt to the coresdump file
----
Core was generated by `/usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007faa069c654c in ?? ()
[Current thread is 1 (LWP 3173)]
(gdb) bt
#0  0x00007faa069c654c in ?? ()
#1  0x0000000000000000 in ?? ()

Comment 1 Manisha Saini 2023-11-04 20:20:06 UTC
logs and coredump - http://magna002.ceph.redhat.com/ceph-qe-logs/msaini/BZ_2247762/

Comment 2 Sachin Punadikar 2023-11-06 13:17:30 UTC
Attempted to recreate the issue, but not able to.
Ran small_files tool with 3 clients and 3 exports in loop. But unable to see the crash.
Next step - setup debug env for analysing the collected core dump.

Comment 3 Manisha Saini 2023-11-06 14:44:54 UTC

warning: .dynamic section for "/lib64/libcurl.so.4" is not at the expected address (wrong library or version mismatch?)
Missing separate debuginfo for /lib64/libwbclient.so.0
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/8b/15f192620998eafe7d402ab162a0d5fba9ed50.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfos, use: dnf debuginfo-install dbus-libs-1.12.20-7.el9_2.1.x86_64 glibc-2.34-60.el9_2.7.x86_64 gssproxy-0.8.4-5.el9_2.x86_64 keyutils-libs-1.6.3-1.el9.x86_64 krb5-libs-1.20.1-9.el9_2.x86_64 libacl-2.3.1-3.el9.x86_64 libattr-2.5.1-3.el9.x86_64 libblkid-2.37.4-11.el9_2.x86_64 libcap-2.48-9.el9_2.x86_64 libcephfs2-18.2.0-113.el9cp.x86_64 libcurl-7.76.1-23.el9_2.4.x86_64 libgcc-11.3.1-4.3.el9.x86_64 libgcrypt-1.10.0-10.el9_2.x86_64 libibverbs-44.0-2.el9.x86_64 libnfsidmap-2.5.4-18.el9.x86_64 libnl3-3.7.0-1.el9.x86_64 librados2-18.2.0-113.el9cp.x86_64 librdmacm-44.0-2.el9.x86_64 libselinux-3.5-1.el9.x86_64 libstdc++-11.3.1-4.3.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 lttng-ust-2.12.0-6.el9.x86_64 lz4-libs-1.9.3-5.el9.x86_64 numactl-libs-2.0.14-9.el9.x86_64 openssl-libs-3.0.7-17.el9_2.x86_64 pcre2-10.40-2.el9.x86_64 sssd-client-2.8.2-3.el9_2.x86_64 systemd-libs-252-14.el9_2.3.x86_64 thrift-0.15.0-2.el9cp.x86_64 userspace-rcu-0.12.1-6.el9.x86_64 zlib-1.2.11-39.el9.x86_64
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007faa069c654c in __pthread_kill_implementation () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fa9796eb640 (LWP 3173))]
(gdb) bt
#0  0x00007faa069c654c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007faa06979d46 in raise () from /lib64/libc.so.6
#2  0x00007faa0694d7f3 in abort () from /lib64/libc.so.6
#3  0x00007faa050e2fb1 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from /usr/lib64/ceph/libceph-common.so.2
#4  0x00007faa050e3115 in ceph::__ceph_assert_fail(ceph::assert_data const&) () from /usr/lib64/ceph/libceph-common.so.2
#5  0x00007faa001ad8be in operator<<(std::ostream&, Inode const&) () from /lib64/libcephfs.so.2
#6  0x00007faa0010bd35 in ceph_ll_sync_inode () from /lib64/libcephfs.so.2
#7  0x00007faa04bd3571 in fsal_ceph_ll_setattr (creds=<optimized out>, mask=408, stx=0x7fa9796e9230, i=0x7fa97d8442b0, cmount=0x7fa990002df0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/FSAL_CEPH/statx_compat.h:210
#8  ceph_fsal_setattr2 (obj_hdl=0x7fa82823a660, bypass=<optimized out>, state=<optimized out>, attrib_set=0x7fa9796e94b0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/FSAL_CEPH/handle.c:2827
#9  0x00007faa06c34c16 in mdcache_setattr2 (obj_hdl=0x7fa9240fe088, bypass=<optimized out>, state=0x0, attrs=0x7fa9796e94b0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1049
#10 0x00007faa06b6540b in fsal_setattr (obj=0x7fa9240fe088, bypass=<optimized out>, state=0x0, attr=0x7fa9796e94b0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/fsal_helper.c:596
#11 0x00007faa06c0beed in nfs4_op_setattr (op=0x7fa8282c6060, data=0x7fa8280a8200, resp=0x7fa8282d1fd0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_op_setattr.c:219
#12 0x00007faa06becdfd in process_one_op (data=0x7fa8280a8200, status=0x7fa9796e968c)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:912
#13 0x00007faa06bf1048 in nfs4_Compound (arg=<optimized out>, req=0x7fa82809f210, res=0x7fa8282cf8f0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:1376
#14 0x00007faa06b85b66 in nfs_rpc_process_request (reqdata=0x7fa82809f210, retry=<optimized out>)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/MainNFSD/nfs_worker_thread.c:1517
#15 0x00007faa068fd387 in svc_request (xprt=0x7fa95800fbb0, xdrs=<optimized out>)
    at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1202
#16 0x00007faa0690178a in svc_rqst_xprt_task_recv (wpe=<optimized out>) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1183
#17 0x00007faa06904fcb in svc_rqst_epoll_loop (wpe=0x556273e63498) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1564
#18 0x00007faa0690e95c in work_pool_thread (arg=0x7fa8f807a3a0) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/work_pool.c:183
#19 0x00007faa069c4802 in start_thread () from /lib64/libc.so.6
#20 0x00007faa06964450 in clone3 () from /lib64/libc.so.6
(gdb)

Comment 4 Sachin Punadikar 2023-11-07 12:58:40 UTC
Looked at the core dump.
The actual issue is in Ceph client code.

(gdb) bt
#0  0x00007faa069c654c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007faa06979d46 in raise () from /lib64/libc.so.6
#2  0x00007faa0694d7f3 in abort () from /lib64/libc.so.6
#3  0x00007faa050e2fb1 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from /usr/lib64/ceph/libceph-common.so.2
#4  0x00007faa050e3115 in ceph::__ceph_assert_fail(ceph::assert_data const&) () from /usr/lib64/ceph/libceph-common.so.2
#5  0x00007faa001ad8be in xlist<Dentry*>::const_iterator::operator++ (this=<synthetic pointer>) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/include/xlist.h:205
#6  operator<< (list=..., list=..., oss=...) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/include/xlist.h:225
#7  operator<< (out=..., in=...) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/client/Inode.cc:84
#8  0x00007faa0010bd35 in Client::ll_sync_inode (syncdataonly=false, in=0x7fa97d8442b0, this=0x7fa990b74cb0) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/client/Client.cc:15712
#9  ceph_ll_sync_inode (cmount=cmount@entry=0x7fa990002df0, in=in@entry=0x7fa97d8442b0, syncdataonly=syncdataonly@entry=0)
    at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/libcephfs.cc:1866
#10 0x00007faa04bd3571 in fsal_ceph_ll_setattr (creds=<optimized out>, mask=408, stx=0x7fa9796e9230, i=0x7fa97d8442b0, cmount=0x7fa990002df0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/FSAL_CEPH/statx_compat.h:210
#11 ceph_fsal_setattr2 (obj_hdl=0x7fa82823a660, bypass=<optimized out>, state=<optimized out>, attrib_set=0x7fa9796e94b0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/FSAL_CEPH/handle.c:2827
#12 0x00007faa06c34c16 in mdcache_setattr2 (obj_hdl=0x7fa9240fe088, bypass=<optimized out>, state=0x0, attrs=0x7fa9796e94b0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1049
#13 0x00007faa06b6540b in fsal_setattr (obj=0x7fa9240fe088, bypass=<optimized out>, state=0x0, attr=0x7fa9796e94b0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/fsal_helper.c:596
#14 0x00007faa06c0beed in nfs4_op_setattr (op=0x7fa8282c6060, data=0x7fa8280a8200, resp=0x7fa8282d1fd0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_op_setattr.c:219
#15 0x00007faa06becdfd in process_one_op (data=0x7fa8280a8200, status=0x7fa9796e968c) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:912
#16 0x00007faa06bf1048 in nfs4_Compound (arg=<optimized out>, req=0x7fa82809f210, res=0x7fa8282cf8f0)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:1376
#17 0x00007faa06b85b66 in nfs_rpc_process_request (reqdata=0x7fa82809f210, retry=<optimized out>)
    at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/MainNFSD/nfs_worker_thread.c:1517
#18 0x00007faa068fd387 in svc_request (xprt=0x7fa95800fbb0, xdrs=<optimized out>) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1202
#19 0x00007faa0690178a in svc_rqst_xprt_task_recv (wpe=<optimized out>) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1183
#20 0x00007faa06904fcb in svc_rqst_epoll_loop (wpe=0x556273e63498) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1564
#21 0x00007faa0690e95c in work_pool_thread (arg=0x7fa8f807a3a0) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/work_pool.c:183
#22 0x00007faa069c4802 in start_thread () from /lib64/libc.so.6
#23 0x00007faa06964450 in clone3 () from /lib64/libc.so.6

(gdb) f 8
#8  0x00007faa0010bd35 in Client::ll_sync_inode (syncdataonly=false, in=0x7fa97d8442b0, this=0x7fa990b74cb0) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/client/Client.cc:15712
15712	  ldout(cct, 3) << "ll_sync_inode " << *in << " " << dendl;
(gdb) l
15707	{
15708	  RWRef_t mref_reader(mount_state, CLIENT_MOUNTING);
15709	  if (!mref_reader.is_state_satisfied())
15710	    return -CEPHFS_ENOTCONN;
15711
15712	  ldout(cct, 3) << "ll_sync_inode " << *in << " " << dendl;
15713	  tout(cct) << "ll_sync_inode" << std::endl;
15714	  tout(cct) << (uintptr_t)in << std::endl;
15715
15716	  std::scoped_lock lock(client_lock);
(gdb) f 7
#7  operator<< (out=..., in=...) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/client/Inode.cc:84
84	    out << " parents=" << in.dentries;

(gdb) l
79
80	  if (in.is_file())
81	    out << " " << in.oset;
82
83	  if (!in.dentries.empty())
84	    out << " parents=" << in.dentries;
85
86	  if (in.is_dir() && in.has_dir_layout())
87	    out << " has_dir_layout";
(gdb) p in.dentries
$6 = {_front = 0x0, _back = 0x0, _size = 0}

As observed, the object is regular file and hence it will not have dentries. So ideally it should not have gone for printing dentries.
There are other threads in core dump which are holding similar dentries values, but they went ahead in execution. Threads - 62, 100, 104, 130

So it is more likely a timing issue in Ceph client code, which is leading to this kind of crash.

Comment 5 Greg Farnum 2023-11-15 02:07:15 UTC
(In reply to Sachin Punadikar from comment #4)
> Looked at the core dump.
> The actual issue is in Ceph client code.
> 
*snip*
> (gdb) l
> 79
> 80	  if (in.is_file())
> 81	    out << " " << in.oset;
> 82
> 83	  if (!in.dentries.empty())
> 84	    out << " parents=" << in.dentries;
> 85
> 86	  if (in.is_dir() && in.has_dir_layout())
> 87	    out << " has_dir_layout";
> (gdb) p in.dentries
> $6 = {_front = 0x0, _back = 0x0, _size = 0}
> 
> As observed, the object is regular file and hence it will not have dentries.
> So ideally it should not have gone for printing dentries.

As this line suggests, the "dentries" here are the "parents", ie, the name this file is linked to. It *is* possible for that to be an empty set if these inodes were looked up by inode number, rather than through a normal path-based lookup, but I'm surprised if that's the case. And we here see that the set is empty, which contradicts the !in.dentries.empty() guard.

> There are other threads in core dump which are holding similar dentries
> values, but they went ahead in execution. Threads - 62, 100, 104, 130
> 
> So it is more likely a timing issue in Ceph client code, which is leading to
> this kind of crash.

I'd be concerned here that Ganesha sent down an invalid (probably deleted, by dropping references?) inode pointer. We admittedly are doing the debug logging in Client::ll_sync_inode() without any locking, but actually deleting them is covered by a lock. The only way I can see this being modified in-flight is if Ganesha is doing this operation at the same time as it's telling the Ceph client library that Ganesha has forgotten about the inode...

It looks like maybe this isn't reproducible, which contradicts something I was told today in a meeting. If we are reproducing the bug and you believe it's been traced to libcephfs, please gather up as much evidence as you can and set needinfo on Patrick Donnelly and Venky Shankar.

Comment 6 Frank Filz 2023-11-15 15:50:18 UTC
Hmm, maybe we have a ceph inode reference issue. We definitely will need more information to proceed on either side.

I do want to point out though that Ganesha can "lookup an inode by number," or at least not by path lookup, when it has evicted an inode from it's cache and then an NFS client presents a handle. I don't know if the smallfiles workload is big enough to evict entries from Ganesha's mdcache.

Comment 8 Amarnath 2023-11-16 10:35:26 UTC
On Same setup we ran 85 clients and ran smallfile did not see any issue

85 clients we emulated by using containers.
mounted File system(cephfs_io_1) using kernelmount and added this mount point on the containers

smallfile cmd : python3 smallfile/smallfile_cli.py --operation create --threads 10 --file-size 1024 --files 1024 --top /app_test_io/dir_ceph-nfs-client-ymkppj-n

Logs : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-JKF4B6/

[root@ceph-nfs-client-oa18za-node82 docker]# ceph fs status
cephfs - 2 clients
======
RANK  STATE            MDS              ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs.cali015.ednfyg  Reqs:    0 /s  2864    975    967   1926   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata   820M  26.7T  
cephfs.cephfs.data    data    6232k  26.7T  
cephfs_io_1 - 85 clients
===========
RANK  STATE              MDS                 ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs_io_1.cali016.lljlbl  Reqs: 1302 /s   181k   181k  22.3k  77.8k  
          POOL             TYPE     USED  AVAIL  
cephfs.cephfs_io_1.meta  metadata   910M  26.7T  
cephfs.cephfs_io_1.data    data     451G  26.7T  
       STANDBY MDS          
  cephfs.cali018.sxquib     
  cephfs.cali016.qmwsuk     
cephfs_io_1.cali018.wpxrkq  
MDS version: ceph version 18.2.0-122.el9cp (cbcd542ab4c99bbd64f6264a3fb86e80ab335707) reef (stable)
[root@ceph-nfs-client-oa18za-node82 docker]# 

Regards,
Amarnath

Comment 9 Scott Ostapovicz 2023-11-22 05:17:59 UTC
This looks like a nearly unreproducible issue.  As such, can we move this to 7.0 z1 so we can continue to track, but not hold up the release while we continue to search for a reproducible case?

Comment 10 Scott Ostapovicz 2023-11-22 16:14:20 UTC
Bumped to 7.0 z1

Comment 11 Manisha Saini 2023-11-22 19:03:30 UTC
(In reply to Scott Ostapovicz from comment #9)
> This looks like a nearly unreproducible issue.  As such, can we move this to
> 7.0 z1 so we can continue to track, but not hold up the release while we
> continue to search for a reproducible case?

This issue is consistently reproducible on QE environment. QE was able to reproduce this issue 3 times.

QE has also shared the cluster with DEV for debugging --> #comment7.

Comment 15 Scott Ostapovicz 2024-04-10 12:56:58 UTC
Retargeting to 7.0 z3.


Note You need to log in before you can comment on or make changes to this bug.