Description of problem: ============= NFS Ganesha daemon crashed while running small file tool in loop from multiple clients and multiple exports ganesha.log =========== Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: /builddir/build/BUILD/ceph-18.2.0/src/include/xlist.h: In function 'xlist<T>::const_iterator& xlist<T>::const_iterator::operator++() [with T = Dentry*]' thread 7fa9796eb640 time 2023-11-03T00:08:12.895127+0000 Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: /builddir/build/BUILD/ceph-18.2.0/src/include/xlist.h: 205: FAILED ceph_assert(cur->_list) Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: ceph version 18.2.0-113.el9cp (32cbda69435c7145d09eeaf5b5016e5d46370a5d) reef (stable) Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7faa050e2f57] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x161115) [0x7faa050e3115] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 3: /lib64/libcephfs.so.2(+0x1058be) [0x7faa001ad8be] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 4: ceph_ll_sync_inode() Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 5: /usr/lib64/ganesha/libfsalceph.so(+0xc571) [0x7faa04bd3571] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 6: /lib64/libganesha_nfsd.so.5.6(+0x106c16) [0x7faa06c34c16] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 7: fsal_setattr() Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 8: /lib64/libganesha_nfsd.so.5.6(+0xddeed) [0x7faa06c0beed] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 9: /lib64/libganesha_nfsd.so.5.6(+0xbedfd) [0x7faa06becdfd] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 10: /lib64/libganesha_nfsd.so.5.6(+0xc3048) [0x7faa06bf1048] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 11: /lib64/libganesha_nfsd.so.5.6(+0x57b66) [0x7faa06b85b66] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 12: /lib64/libntirpc.so.5.0(+0x1d387) [0x7faa068fd387] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 13: /lib64/libntirpc.so.5.0(+0x2178a) [0x7faa0690178a] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 14: /lib64/libntirpc.so.5.0(+0x24fcb) [0x7faa06904fcb] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 15: /lib64/libntirpc.so.5.0(+0x2e95c) [0x7faa0690e95c] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 16: /lib64/libc.so.6(+0x9f802) [0x7faa069c4802] Nov 03 00:08:12 argo016 ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd[2072828]: 17: /lib64/libc.so.6(+0x3f450) [0x7faa06964450] Nov 03 00:08:19 argo016 podman[2175299]: 2023-11-03 00:08:19.201051869 +0000 UTC m=+0.032517602 container died 65c57a779e96586a6950480f2be4cb3733916b83a52f76d04e46b4e2c3d27d79 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:c4f1cf9ac78d873050a567493eb3a4956c661a1fdb52cf58ba05c995936f5ba0, name=ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd, distribution-scope=public, architecture=x86_64, vcs-type=git, GIT_CLEAN=True, release=120, description=Red Hat Ceph Storage 7, io.buildah.version=1.29.0, GIT_BRANCH=main, maintainer=Guillaume Abrioux <gabrioux>, name=rhceph, RELEASE=main, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, com.redhat.component=rhceph-container, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, build-date=2023-11-01T22:27:47, version=7, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image., io.openshift.expose-services=, vendor=Red Hat, Inc., GIT_REPO=https://github.com/ceph/ceph-container.git, GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, com.redhat.license_terms=https://www.redhat.com/agreements, io.openshift.tags=rhceph ceph, CEPH_POINT_RELEASE=, ceph=True, io.k8s.description=Red Hat Ceph Storage 7, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-120) Nov 03 00:08:19 argo016 podman[2175299]: 2023-11-03 00:08:19.21832443 +0000 UTC m=+0.049790163 container remove 65c57a779e96586a6950480f2be4cb3733916b83a52f76d04e46b4e2c3d27d79 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:c4f1cf9ac78d873050a567493eb3a4956c661a1fdb52cf58ba05c995936f5ba0, name=ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844-nfs-nfsganesha-0-0-argo016-faezmd, distribution-scope=public, GIT_BRANCH=main, RELEASE=main, description=Red Hat Ceph Storage 7, ceph=True, maintainer=Guillaume Abrioux <gabrioux>, name=rhceph, GIT_REPO=https://github.com/ceph/ceph-container.git, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, io.openshift.expose-services=, version=7, io.k8s.description=Red Hat Ceph Storage 7, com.redhat.component=rhceph-container, vcs-type=git, vendor=Red Hat, Inc., com.redhat.license_terms=https://www.redhat.com/agreements, CEPH_POINT_RELEASE=, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image., GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-120, build-date=2023-11-01T22:27:47, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, architecture=x86_64, io.openshift.tags=rhceph ceph, release=120, io.buildah.version=1.29.0, GIT_CLEAN=True) Nov 03 00:08:19 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Main process exited, code=exited, status=134/n/a Nov 03 00:08:19 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Failed with result 'exit-code'. Nov 03 00:08:19 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Consumed 1h 31min 3.649s CPU time. Nov 03 00:08:29 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Scheduled restart job, restart counter is at 1. Nov 03 00:08:29 argo016 systemd[1]: Stopped Ceph nfs.nfsganesha.0.0.argo016.faezmd for 348978a2-72b5-11ee-88f8-ac1f6b0a1844. Nov 03 00:08:29 argo016 systemd[1]: ceph-348978a2-72b5-11ee-88f8-ac1f6b0a1844.0.0.argo016.faezmd.service: Consumed 1h 31min 3.649s CPU time. Nov 03 00:08:29 argo016 systemd[1]: Starting Ceph nfs.nfsganesha.0.0.argo016.faezmd for 348978a2-72b5-11ee-88f8-ac1f6b0a1844... Version-Release number of selected component (if applicable): ===================== # rpm -qa | grep nfs libnfsidmap-2.5.4-18.el9.x86_64 nfs-utils-2.5.4-18.el9.x86_64 nfs-ganesha-selinux-5.6-3.el9cp.noarch nfs-ganesha-5.6-3.el9cp.x86_64 nfs-ganesha-rgw-5.6-3.el9cp.x86_64 nfs-ganesha-ceph-5.6-3.el9cp.x86_64 nfs-ganesha-rados-grace-5.6-3.el9cp.x86_64 nfs-ganesha-rados-urls-5.6-3.el9cp.x86_64 # ceph --version ceph version 18.2.0-113.el9cp (32cbda69435c7145d09eeaf5b5016e5d46370a5d) reef (stable) How reproducible: ================ 1/1 Steps to Reproduce: ============= 1.Create nfs ganesha cluster # ceph nfs cluster info nfsganesha { "nfsganesha": { "backend": [ { "hostname": "argo016", "ip": "10.8.128.216", "port": 2049 }, { "hostname": "argo018", "ip": "10.8.128.218", "port": 2049 } ], "virtual_ip": null } } 2.Create cephfs filesystem, 1 subvol group and 15 subvolumes # ceph fs subvolume ls cephfs --group_name ganeshagroup [ { "name": "ganesha4" }, { "name": "ganesha8" }, { "name": "ganesha15" }, { "name": "ganesha1" }, { "name": "ganesha2" }, { "name": "ganesha13" }, { "name": "ganesha6" }, { "name": "ganesha11" }, { "name": "ganesha14" }, { "name": "ganesha9" }, { "name": "ganesha10" }, { "name": "ganesha5" }, { "name": "ganesha12" }, { "name": "ganesha7" }, { "name": "ganesha3" } ] 3.Create ganesha exports out of these subvolumes # ceph nfs export ls nfsganesha [ "/ganesha1", "/ganesha2", "/ganesha3", "/ganesha4", "/ganesha5", "/ganesha6", "/ganesha7", "/ganesha8", "/ganesha9", "/ganesha10", "/ganesha11", "/ganesha12", "/ganesha13", "/ganesha14", "/ganesha15" ] 4.Mount 10 exports on 10 clients and run small file tool in loop # for i in $(seq 1 10); do python3 smallfile_cli.py --operation create --threads 4 --file-size 4194 --files 1024 --files-per-dir 10 --dirs-per-dir 2 --record-size 128 --top /mnt/ganesha/smallfile$i --output-json=create.json;done Actual results: ======= NFS-Ganesha daemon crashed Expected results: ====== No crashes should be observed Additional info: ======== Core file generated --------------- [root@argo016 coredump]# ls 'core.ganesha\x2enfsd.0.80243653daa041879ce857169c3a9343.2072834.1698970092000000' 'core.ganesha\x2enfsd.0.80243653daa041879ce857169c3a9343.2072834.1698970092000000.zst' [root@argo016 coredump]# Unable to bt to the coresdump file ---- Core was generated by `/usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT'. Program terminated with signal SIGABRT, Aborted. #0 0x00007faa069c654c in ?? () [Current thread is 1 (LWP 3173)] (gdb) bt #0 0x00007faa069c654c in ?? () #1 0x0000000000000000 in ?? ()
logs and coredump - http://magna002.ceph.redhat.com/ceph-qe-logs/msaini/BZ_2247762/
Attempted to recreate the issue, but not able to. Ran small_files tool with 3 clients and 3 exports in loop. But unable to see the crash. Next step - setup debug env for analysing the collected core dump.
warning: .dynamic section for "/lib64/libcurl.so.4" is not at the expected address (wrong library or version mismatch?) Missing separate debuginfo for /lib64/libwbclient.so.0 Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/8b/15f192620998eafe7d402ab162a0d5fba9ed50.debug [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfos, use: dnf debuginfo-install dbus-libs-1.12.20-7.el9_2.1.x86_64 glibc-2.34-60.el9_2.7.x86_64 gssproxy-0.8.4-5.el9_2.x86_64 keyutils-libs-1.6.3-1.el9.x86_64 krb5-libs-1.20.1-9.el9_2.x86_64 libacl-2.3.1-3.el9.x86_64 libattr-2.5.1-3.el9.x86_64 libblkid-2.37.4-11.el9_2.x86_64 libcap-2.48-9.el9_2.x86_64 libcephfs2-18.2.0-113.el9cp.x86_64 libcurl-7.76.1-23.el9_2.4.x86_64 libgcc-11.3.1-4.3.el9.x86_64 libgcrypt-1.10.0-10.el9_2.x86_64 libibverbs-44.0-2.el9.x86_64 libnfsidmap-2.5.4-18.el9.x86_64 libnl3-3.7.0-1.el9.x86_64 librados2-18.2.0-113.el9cp.x86_64 librdmacm-44.0-2.el9.x86_64 libselinux-3.5-1.el9.x86_64 libstdc++-11.3.1-4.3.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 lttng-ust-2.12.0-6.el9.x86_64 lz4-libs-1.9.3-5.el9.x86_64 numactl-libs-2.0.14-9.el9.x86_64 openssl-libs-3.0.7-17.el9_2.x86_64 pcre2-10.40-2.el9.x86_64 sssd-client-2.8.2-3.el9_2.x86_64 systemd-libs-252-14.el9_2.3.x86_64 thrift-0.15.0-2.el9cp.x86_64 userspace-rcu-0.12.1-6.el9.x86_64 zlib-1.2.11-39.el9.x86_64 --Type <RET> for more, q to quit, c to continue without paging-- Core was generated by `/usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT'. Program terminated with signal SIGABRT, Aborted. #0 0x00007faa069c654c in __pthread_kill_implementation () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x7fa9796eb640 (LWP 3173))] (gdb) bt #0 0x00007faa069c654c in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x00007faa06979d46 in raise () from /lib64/libc.so.6 #2 0x00007faa0694d7f3 in abort () from /lib64/libc.so.6 #3 0x00007faa050e2fb1 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from /usr/lib64/ceph/libceph-common.so.2 #4 0x00007faa050e3115 in ceph::__ceph_assert_fail(ceph::assert_data const&) () from /usr/lib64/ceph/libceph-common.so.2 #5 0x00007faa001ad8be in operator<<(std::ostream&, Inode const&) () from /lib64/libcephfs.so.2 #6 0x00007faa0010bd35 in ceph_ll_sync_inode () from /lib64/libcephfs.so.2 #7 0x00007faa04bd3571 in fsal_ceph_ll_setattr (creds=<optimized out>, mask=408, stx=0x7fa9796e9230, i=0x7fa97d8442b0, cmount=0x7fa990002df0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/FSAL_CEPH/statx_compat.h:210 #8 ceph_fsal_setattr2 (obj_hdl=0x7fa82823a660, bypass=<optimized out>, state=<optimized out>, attrib_set=0x7fa9796e94b0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/FSAL_CEPH/handle.c:2827 #9 0x00007faa06c34c16 in mdcache_setattr2 (obj_hdl=0x7fa9240fe088, bypass=<optimized out>, state=0x0, attrs=0x7fa9796e94b0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1049 #10 0x00007faa06b6540b in fsal_setattr (obj=0x7fa9240fe088, bypass=<optimized out>, state=0x0, attr=0x7fa9796e94b0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/fsal_helper.c:596 #11 0x00007faa06c0beed in nfs4_op_setattr (op=0x7fa8282c6060, data=0x7fa8280a8200, resp=0x7fa8282d1fd0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_op_setattr.c:219 #12 0x00007faa06becdfd in process_one_op (data=0x7fa8280a8200, status=0x7fa9796e968c) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:912 #13 0x00007faa06bf1048 in nfs4_Compound (arg=<optimized out>, req=0x7fa82809f210, res=0x7fa8282cf8f0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:1376 #14 0x00007faa06b85b66 in nfs_rpc_process_request (reqdata=0x7fa82809f210, retry=<optimized out>) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/MainNFSD/nfs_worker_thread.c:1517 #15 0x00007faa068fd387 in svc_request (xprt=0x7fa95800fbb0, xdrs=<optimized out>) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1202 #16 0x00007faa0690178a in svc_rqst_xprt_task_recv (wpe=<optimized out>) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1183 #17 0x00007faa06904fcb in svc_rqst_epoll_loop (wpe=0x556273e63498) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1564 #18 0x00007faa0690e95c in work_pool_thread (arg=0x7fa8f807a3a0) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/work_pool.c:183 #19 0x00007faa069c4802 in start_thread () from /lib64/libc.so.6 #20 0x00007faa06964450 in clone3 () from /lib64/libc.so.6 (gdb)
Looked at the core dump. The actual issue is in Ceph client code. (gdb) bt #0 0x00007faa069c654c in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x00007faa06979d46 in raise () from /lib64/libc.so.6 #2 0x00007faa0694d7f3 in abort () from /lib64/libc.so.6 #3 0x00007faa050e2fb1 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from /usr/lib64/ceph/libceph-common.so.2 #4 0x00007faa050e3115 in ceph::__ceph_assert_fail(ceph::assert_data const&) () from /usr/lib64/ceph/libceph-common.so.2 #5 0x00007faa001ad8be in xlist<Dentry*>::const_iterator::operator++ (this=<synthetic pointer>) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/include/xlist.h:205 #6 operator<< (list=..., list=..., oss=...) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/include/xlist.h:225 #7 operator<< (out=..., in=...) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/client/Inode.cc:84 #8 0x00007faa0010bd35 in Client::ll_sync_inode (syncdataonly=false, in=0x7fa97d8442b0, this=0x7fa990b74cb0) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/client/Client.cc:15712 #9 ceph_ll_sync_inode (cmount=cmount@entry=0x7fa990002df0, in=in@entry=0x7fa97d8442b0, syncdataonly=syncdataonly@entry=0) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/libcephfs.cc:1866 #10 0x00007faa04bd3571 in fsal_ceph_ll_setattr (creds=<optimized out>, mask=408, stx=0x7fa9796e9230, i=0x7fa97d8442b0, cmount=0x7fa990002df0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/FSAL_CEPH/statx_compat.h:210 #11 ceph_fsal_setattr2 (obj_hdl=0x7fa82823a660, bypass=<optimized out>, state=<optimized out>, attrib_set=0x7fa9796e94b0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/FSAL_CEPH/handle.c:2827 #12 0x00007faa06c34c16 in mdcache_setattr2 (obj_hdl=0x7fa9240fe088, bypass=<optimized out>, state=0x0, attrs=0x7fa9796e94b0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1049 #13 0x00007faa06b6540b in fsal_setattr (obj=0x7fa9240fe088, bypass=<optimized out>, state=0x0, attr=0x7fa9796e94b0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/FSAL/fsal_helper.c:596 #14 0x00007faa06c0beed in nfs4_op_setattr (op=0x7fa8282c6060, data=0x7fa8280a8200, resp=0x7fa8282d1fd0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_op_setattr.c:219 #15 0x00007faa06becdfd in process_one_op (data=0x7fa8280a8200, status=0x7fa9796e968c) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:912 #16 0x00007faa06bf1048 in nfs4_Compound (arg=<optimized out>, req=0x7fa82809f210, res=0x7fa8282cf8f0) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:1376 #17 0x00007faa06b85b66 in nfs_rpc_process_request (reqdata=0x7fa82809f210, retry=<optimized out>) at /usr/src/debug/nfs-ganesha-5.6-3.el9cp.x86_64/src/MainNFSD/nfs_worker_thread.c:1517 #18 0x00007faa068fd387 in svc_request (xprt=0x7fa95800fbb0, xdrs=<optimized out>) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1202 #19 0x00007faa0690178a in svc_rqst_xprt_task_recv (wpe=<optimized out>) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1183 #20 0x00007faa06904fcb in svc_rqst_epoll_loop (wpe=0x556273e63498) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/svc_rqst.c:1564 #21 0x00007faa0690e95c in work_pool_thread (arg=0x7fa8f807a3a0) at /usr/src/debug/libntirpc-5.0-1.el9cp.x86_64/src/work_pool.c:183 #22 0x00007faa069c4802 in start_thread () from /lib64/libc.so.6 #23 0x00007faa06964450 in clone3 () from /lib64/libc.so.6 (gdb) f 8 #8 0x00007faa0010bd35 in Client::ll_sync_inode (syncdataonly=false, in=0x7fa97d8442b0, this=0x7fa990b74cb0) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/client/Client.cc:15712 15712 ldout(cct, 3) << "ll_sync_inode " << *in << " " << dendl; (gdb) l 15707 { 15708 RWRef_t mref_reader(mount_state, CLIENT_MOUNTING); 15709 if (!mref_reader.is_state_satisfied()) 15710 return -CEPHFS_ENOTCONN; 15711 15712 ldout(cct, 3) << "ll_sync_inode " << *in << " " << dendl; 15713 tout(cct) << "ll_sync_inode" << std::endl; 15714 tout(cct) << (uintptr_t)in << std::endl; 15715 15716 std::scoped_lock lock(client_lock); (gdb) f 7 #7 operator<< (out=..., in=...) at /usr/src/debug/ceph-18.2.0-113.el9cp.x86_64/src/client/Inode.cc:84 84 out << " parents=" << in.dentries; (gdb) l 79 80 if (in.is_file()) 81 out << " " << in.oset; 82 83 if (!in.dentries.empty()) 84 out << " parents=" << in.dentries; 85 86 if (in.is_dir() && in.has_dir_layout()) 87 out << " has_dir_layout"; (gdb) p in.dentries $6 = {_front = 0x0, _back = 0x0, _size = 0} As observed, the object is regular file and hence it will not have dentries. So ideally it should not have gone for printing dentries. There are other threads in core dump which are holding similar dentries values, but they went ahead in execution. Threads - 62, 100, 104, 130 So it is more likely a timing issue in Ceph client code, which is leading to this kind of crash.
(In reply to Sachin Punadikar from comment #4) > Looked at the core dump. > The actual issue is in Ceph client code. > *snip* > (gdb) l > 79 > 80 if (in.is_file()) > 81 out << " " << in.oset; > 82 > 83 if (!in.dentries.empty()) > 84 out << " parents=" << in.dentries; > 85 > 86 if (in.is_dir() && in.has_dir_layout()) > 87 out << " has_dir_layout"; > (gdb) p in.dentries > $6 = {_front = 0x0, _back = 0x0, _size = 0} > > As observed, the object is regular file and hence it will not have dentries. > So ideally it should not have gone for printing dentries. As this line suggests, the "dentries" here are the "parents", ie, the name this file is linked to. It *is* possible for that to be an empty set if these inodes were looked up by inode number, rather than through a normal path-based lookup, but I'm surprised if that's the case. And we here see that the set is empty, which contradicts the !in.dentries.empty() guard. > There are other threads in core dump which are holding similar dentries > values, but they went ahead in execution. Threads - 62, 100, 104, 130 > > So it is more likely a timing issue in Ceph client code, which is leading to > this kind of crash. I'd be concerned here that Ganesha sent down an invalid (probably deleted, by dropping references?) inode pointer. We admittedly are doing the debug logging in Client::ll_sync_inode() without any locking, but actually deleting them is covered by a lock. The only way I can see this being modified in-flight is if Ganesha is doing this operation at the same time as it's telling the Ceph client library that Ganesha has forgotten about the inode... It looks like maybe this isn't reproducible, which contradicts something I was told today in a meeting. If we are reproducing the bug and you believe it's been traced to libcephfs, please gather up as much evidence as you can and set needinfo on Patrick Donnelly and Venky Shankar.
Hmm, maybe we have a ceph inode reference issue. We definitely will need more information to proceed on either side. I do want to point out though that Ganesha can "lookup an inode by number," or at least not by path lookup, when it has evicted an inode from it's cache and then an NFS client presents a handle. I don't know if the smallfiles workload is big enough to evict entries from Ganesha's mdcache.
On Same setup we ran 85 clients and ran smallfile did not see any issue 85 clients we emulated by using containers. mounted File system(cephfs_io_1) using kernelmount and added this mount point on the containers smallfile cmd : python3 smallfile/smallfile_cli.py --operation create --threads 10 --file-size 1024 --files 1024 --top /app_test_io/dir_ceph-nfs-client-ymkppj-n Logs : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-JKF4B6/ [root@ceph-nfs-client-oa18za-node82 docker]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.cali015.ednfyg Reqs: 0 /s 2864 975 967 1926 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 820M 26.7T cephfs.cephfs.data data 6232k 26.7T cephfs_io_1 - 85 clients =========== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs_io_1.cali016.lljlbl Reqs: 1302 /s 181k 181k 22.3k 77.8k POOL TYPE USED AVAIL cephfs.cephfs_io_1.meta metadata 910M 26.7T cephfs.cephfs_io_1.data data 451G 26.7T STANDBY MDS cephfs.cali018.sxquib cephfs.cali016.qmwsuk cephfs_io_1.cali018.wpxrkq MDS version: ceph version 18.2.0-122.el9cp (cbcd542ab4c99bbd64f6264a3fb86e80ab335707) reef (stable) [root@ceph-nfs-client-oa18za-node82 docker]# Regards, Amarnath
This looks like a nearly unreproducible issue. As such, can we move this to 7.0 z1 so we can continue to track, but not hold up the release while we continue to search for a reproducible case?
Bumped to 7.0 z1
(In reply to Scott Ostapovicz from comment #9) > This looks like a nearly unreproducible issue. As such, can we move this to > 7.0 z1 so we can continue to track, but not hold up the release while we > continue to search for a reproducible case? This issue is consistently reproducible on QE environment. QE was able to reproduce this issue 3 times. QE has also shared the cluster with DEV for debugging --> #comment7.
Retargeting to 7.0 z3.