2362861 – NFS cluster Crashing while performing read operation with QoS enabled

Bug 2362861 - NFS cluster Crashing while performing read operation with QoS enabled

Summary: NFS cluster Crashing while performing read operation with QoS enabled

Keywords:
Status:	CLOSED DUPLICATE of bug 2362289
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	8.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	8.1
Assignee:	Venky Shankar
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2025-04-29 08:53 UTC by hacharya
Modified:	2025-06-18 13:59 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2025-05-15 07:08:26 UTC
Embargoed:
Dependent Products:
Flags:	khiremat: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-11290	0	None	None	None	2025-04-29 08:53:57 UTC

Description hacharya 2025-04-29 08:53:26 UTC

Description of problem:
NFS cluster Crashing while performing read operation with QoS enabled

1. enabled max_client_combined_bw and max_export_combined_bw for PerShare_PerClient -> Performed write operation using dd command and bandwidth control out put was write speed is 7.7 MB/s.

2025-04-29 13:01:15,509 - cephci - ceph:1630 - INFO - Execution of dd if=/dev/urandom of=/mnt/nfs/sample.txt bs=100M count=1 on 10.0.64.131 took 14.340555 seconds
2025-04-29 13:01:15,509 - cephci - test_nfs_qos_on_cluster_level_enablement:42 - INFO - File created successfully on ceph-harish-vm-49jsl6-node4
2025-04-29 13:01:15,510 - cephci - test_nfs_qos_on_cluster_level_enablement:43 - INFO - write speed is 7.7 MB/s

2.Performed read operation on the same file using dd command and the bandwidth control output was read speed is 211 MB/s

2025-04-29 13:01:17,279 - cephci - ceph:1630 - INFO - Execution of dd if=/mnt/nfs/sample.txt of=/dev/urandom on 10.0.64.131 took 1.296683 seconds
2025-04-29 13:01:17,280 - cephci - test_nfs_qos_on_cluster_level_enablement:59 - INFO - read speed is 211 MB/s

3. Dropped the cache with cmd 
echo 3 > /proc/sys/vm/drop_caches

4. again repeated read operation as mentioned in step 2. 

Observation
===========
On step 4. Ganesha cluster got crashed and dumped multiple cores.

[root@cali016 coredump]# ls
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2922972.1745914458000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2925338.1745914479000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2925629.1745914501000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2926033.1745914523000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2926312.1745914545000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2927782.1745914813000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2929508.1745914835000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2929780.1745914857000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2930689.1745914891000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2932474.1745914913000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2932801.1745914935000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2933081.1745914957000000.zst'
'core.ganesha\x2enfsd.0.5abdc2e9cbaa4546825301ce4b3b9d46.2933477.1745914979000000.zst'

ceph ps 
============
[ceph: root@cali013 /]# ceph orch ps | grep nfs
haproxy.nfs.TestClusterHA.cali016.chkhse     cali016  *:2049,9049       running (24m)    79s ago  24m     101M        -  2.4.22-f8e3218    6c223bddea69  e05e3e48fd0c
keepalived.nfs.TestClusterHA.cali016.sxqhdd  cali016                    running (24m)    79s ago  24m    1555k        -  2.2.8             09859a486cb9  f159676f7347
nfs.TestClusterHA.0.0.cali016.tzahej         cali016  *:12049           error            79s ago  24m        -        -  <unknown>         <unknown>     <unknown>


ceph -s
===========
[ceph: root@cali013 /]# ceph -s
  cluster:
    id:     288c1062-18fb-11f0-a987-b49691cee574
    health: HEALTH_WARN
            1 failed cephadm daemon(s)
  services:
    mon: 5 daemons, quorum cali013,cali020,cali016,cali019,cali015 (age 93m)
    mgr: cali016.pslybk(active, since 95m), standbys: cali013.heutyr
    mds: 1/1 daemons up, 1 standby
    osd: 34 osds: 34 up (since 85m), 34 in (since 9d)
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 1073 pgs
    objects: 56 objects, 101 MiB
    usage:   3.1 GiB used, 84 TiB / 84 TiB avail
    pgs:     1073 active+clean

Version-Release number of selected component (if applicable):
20
[ceph: root@cali013 /]# ceph --version
ceph version 19.2.1-154.el9cp (66ec30425949b52e06ca00d78ef0b1cc395e6a39) squid (stable)
[ceph: root@cali013 /]# rpm -qa | grep nfs
libnfsidmap-2.5.4-27.el9.x86_64
nfs-utils-2.5.4-27.el9.x86_64
nfs-ganesha-selinux-6.5-10.el9cp.noarch
nfs-ganesha-6.5-10.el9cp.x86_64
nfs-ganesha-ceph-6.5-10.el9cp.x86_64
nfs-ganesha-rados-grace-6.5-10.el9cp.x86_64
nfs-ganesha-rados-urls-6.5-10.el9cp.x86_64
nfs-ganesha-rgw-6.5-10.el9cp.x86_64


How reproducible:
always

Steps to Reproduce:
Mentioned above

Actual results:
NFS cluster Crashing while performing read operation with QoS enabled

Expected results:
NFS ganesha shoudn't crash

Additional info:

back trace
==============
#0  0x00007f52bff0554c in __stpcpy_evex () from /lib64/libc.so.6
[Current thread is 1 (LWP 104)]
Missing separate debuginfos, use: dnf debuginfo-install abseil-cpp-20211102.0-4.el9.x86_64 c-ares-1.19.1-2.el9_4.x86_64 dbus-libs-1.12.20-8.el9.x86_64 glibc-2.34-125.el9_5.1.x86_64 grpc-1.46.7-10.el9.x86_64 grpc-cpp-1.46.7-10.el9.x86_64 gssproxy-0.8.4-7.el9.x86_64 keyutils-libs-1.6.3-1.el9.x86_64 krb5-libs-1.21.1-4.el9_5.x86_64 libacl-2.3.1-4.el9.x86_64 libattr-2.5.1-3.el9.x86_64 libblkid-2.37.4-20.el9.x86_64 libcom_err-1.46.5-5.el9.x86_64 libcurl-7.76.1-31.el9.x86_64 libgcc-11.5.0-5.el9_5.x86_64 libgpg-error-1.42-5.el9.x86_64 libibverbs-51.0-1.el9.x86_64 libicu-67.1-9.el9.x86_64 libnfsidmap-2.5.4-27.el9.x86_64 libnghttp2-1.43.0-6.el9.x86_64 libnl3-3.9.0-1.el9.x86_64 librdmacm-51.0-1.el9.x86_64 libselinux-3.6-1.el9.x86_64 libstdc++-11.5.0-5.el9_5.x86_64 libuuid-2.37.4-20.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 lttng-ust-2.12.0-6.el9.x86_64 lz4-libs-1.9.3-5.el9.x86_64 numactl-libs-2.0.18-2.el9.x86_64 openssl-libs-3.2.2-6.el9_5.1.x86_64 pcre2-10.40-6.el9.x86_64 protobuf-3.14.0-13.el9.x86_64 sssd-client-2.9.5-4.el9_5.4.x86_64 userspace-rcu-0.12.1-6.el9.x86_64 xz-libs-5.2.5-8.el9_0.x86_64 zlib-1.2.11-40.el9.x86_64
(gdb) bt
#0  0x00007f52bff0554c in __stpcpy_evex () from /lib64/libc.so.6
#1  0x00007f52bd991e84 in ceph::buffer::v15_2_0::list::iterator_impl<false>::copy (this=0x7f51a2ffa8f0, len=<optimized out>, dest=0x0)
    at /usr/src/debug/ceph-19.2.1-159.el9cp.x86_64/src/common/buffer.cc:703
#2  0x00007f52b82e7ced in ceph_ll_read (cmount=<optimized out>, filehandle=<optimized out>, off=off@entry=0, len=<optimized out>, buf=0x0)
    at /usr/src/debug/ceph-19.2.1-159.el9cp.x86_64/src/include/buffer.h:1017
#3  0x00007f52bd2276e3 in ceph_fsal_read2 (obj_hdl=0x7f518c003a90, bypass=<optimized out>, done_cb=0x7f52c00f17e0 <mdc_read_cb>, read_arg=0x7f51852e47f8,
    caller_arg=0x7f51841d5b30) at /usr/src/debug/nfs-ganesha-6.5-10.el9cp.x86_64/src/FSAL/FSAL_CEPH/handle.c:2018
#4  0x00007f52c00f0f23 in mdcache_read2 (obj_hdl=0x7f518c01e648, bypass=<optimized out>, done_cb=<optimized out>, read_arg=0x7f51852e47f8, caller_arg=<optimized out>)
    at /usr/src/debug/nfs-ganesha-6.5-10.el9cp.x86_64/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:595
#5  0x00007f52c012cb43 in nfs4_read.constprop.0 (op=op@entry=0x7f518401cde0, data=data@entry=0x7f5184030e40, resp=resp@entry=0x7f51844d8aa0, info=info@entry=0x0,
    io=<optimized out>) at /usr/src/debug/nfs-ganesha-6.5-10.el9cp.x86_64/src/Protocols/NFS/nfs4_op_read.c:892
#6  0x00007f52c00bec32 in nfs4_op_read (op=0x7f518401cde0, data=0x7f5184030e40, resp=0x7f51844d8aa0)
    at /usr/src/debug/nfs-ganesha-6.5-10.el9cp.x86_64/src/Protocols/NFS/nfs4_op_read.c:969
#7  0x00007f52c00aa4de in process_one_op (data=data@entry=0x7f5184030e40, status=status@entry=0x7f51a2ffb54c)
    at /usr/src/debug/nfs-ganesha-6.5-10.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:912
#8  0x00007f52c00ac138 in nfs4_Compound (arg=<optimized out>, req=0x7f51843dd5e0, res=0x7f51843d97e0)
    at /usr/src/debug/nfs-ganesha-6.5-10.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:1413
#9  0x00007f52c0025485 in nfs_rpc_process_request (reqdata=<optimized out>, retry=<optimized out>)
    at /usr/src/debug/nfs-ganesha-6.5-10.el9cp.x86_64/src/MainNFSD/nfs_worker_thread.c:1479
#10 0x00007f52bfd745e7 in svc_request (xprt=0x7f516c00b940, xdrs=<optimized out>) at /usr/src/debug/libntirpc-6.3-2.el9cp.x86_64/src/svc_rqst.c:1229
#11 0x00007f52bfd78e5a in svc_rqst_xprt_task_recv (wpe=<optimized out>) at /usr/src/debug/libntirpc-6.3-2.el9cp.x86_64/src/svc_rqst.c:1210
#12 0x00007f52bfd7b91b in svc_rqst_epoll_loop (wpe=0x55f1e5f41e50) at /usr/src/debug/libntirpc-6.3-2.el9cp.x86_64/src/svc_rqst.c:1585
#13 0x00007f52bfd84cbc in work_pool_thread (arg=0x7f51d4078180) at /usr/src/debug/libntirpc-6.3-2.el9cp.x86_64/src/work_pool.c:187
#14 0x00007f52bfe247e2 in pthread_create.5 () from /lib64/libc.so.6
#15 0x0000000000000000 in ?? ()

Note You need to log in before you can comment on or make changes to this bug.