Description of problem: 6 node ganesha cluster.Distributed-Disperse Volume,Mounted to 4 different clients via 4 different server VIP's. Rootsquash enable. IO pattern: 1st Client- Linux Untars (Got completed before crash was hit) 2nd Client- ls -lrt in loop 3rd Client- Subdir mounted (uuid:nfsnobody) - du -sh in loop 4th Client- Bonnie bt ------ Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f167ca79700 (LWP 31185)] mdcache_readdir_chunked (directory=directory@entry=0x7f14980a9070, whence=0, dir_state=dir_state@entry=0x7f167ca77e30, cb=cb@entry=0x55907cbc51f0 <populate_dirent>, attrmask=attrmask@entry=122830, eod_met=eod_met@entry=0x7f167ca77f1b) at /usr/src/debug/nfs-ganesha-2.5.5/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:3156 3156 if (dirent->ck == whence) { (gdb) bt #0 mdcache_readdir_chunked (directory=directory@entry=0x7f14980a9070, whence=0, dir_state=dir_state@entry=0x7f167ca77e30, cb=cb@entry=0x55907cbc51f0 <populate_dirent>, attrmask=attrmask@entry=122830, eod_met=eod_met@entry=0x7f167ca77f1b) at /usr/src/debug/nfs-ganesha-2.5.5/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:3156 #1 0x000055907cc98924 in mdcache_readdir (dir_hdl=0x7f14980a90a8, whence=<optimized out>, dir_state=0x7f167ca77e30, cb=0x55907cbc51f0 <populate_dirent>, attrmask=122830, eod_met=0x7f167ca77f1b) at /usr/src/debug/nfs-ganesha-2.5.5/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:640 #2 0x000055907cbc70e4 in fsal_readdir (directory=directory@entry=0x7f14980a90a8, cookie=cookie@entry=0, nbfound=nbfound@entry=0x7f167ca77f1c, eod_met=eod_met@entry=0x7f167ca77f1b, attrmask=122830, cb=cb@entry=0x55907cc037f0 <nfs4_readdir_callback>, opaque=opaque@entry=0x7f167ca77f20) at /usr/src/debug/nfs-ganesha-2.5.5/src/FSAL/fsal_helper.c:1500 #3 0x000055907cc047bb in nfs4_op_readdir (op=0x7f15d4043920, data=0x7f167ca78150, resp=0x7f1498275670) at /usr/src/debug/nfs-ganesha-2.5.5/src/Protocols/NFS/nfs4_op_readdir.c:627 #4 0x000055907cbf015f in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f14982dcc90) at /usr/src/debug/nfs-ganesha-2.5.5/src/Protocols/NFS/nfs4_Compound.c:752 #5 0x000055907cbe03cb in nfs_rpc_execute (reqdata=reqdata@entry=0x7f15d40008c0) at /usr/src/debug/nfs-ganesha-2.5.5/src/MainNFSD/nfs_worker_thread.c:1290 #6 0x000055907cbe1a2a in worker_run (ctx=0x55907d5ddd50) at /usr/src/debug/nfs-ganesha-2.5.5/src/MainNFSD/nfs_worker_thread.c:1562 #7 0x000055907cc721a9 in fridgethr_start_routine (arg=0x55907d5ddd50) at /usr/src/debug/nfs-ganesha-2.5.5/src/support/fridgethr.c:550 #8 0x00007f16b2540dd5 in start_thread () from /lib64/libpthread.so.0 #9 0x00007f16b1c0cb3d in clone () from /lib64/libc.so.6 (gdb) generate-core-file -------------- Version-Release number of selected component (if applicable): # rpm -qa | grep ganesha nfs-ganesha-2.5.5-8.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-14.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-8.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-8.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1.Create 6 node ganesha cluster 2.Create 6 x (4 + 2) Distributed-Disperse volume.Export the volume via Ganesha 3.Mount the volume to 1 client using 1st server VIP 4.Change the mount point permission to chmod 777 /mnt/mount_point 5.Enable root-squash.Run refresh-config 6.Create a directory named "mani" 7.Mount the subdir "mani" on 2nd client using 2nd server VIP 8.Mount the volume to 2 more clients using different VIP 9.Run IO's and lookups from all the 4 clients. 1st Client- Linux Untars (Got completed before crash was hit) 2nd Client- ls -lrt in loop 3rd Client- Subdir mounted (uuid:nfsnobody) - du -sh in loop 4th Client- Bonnie Actual results: Linux untar got completed successfully. ls -lrt got stuck on one of the client for a while. Stopped bonnie which was running from another client. Ganesha got crashed on the server through which client was mapped performing "ls -lrt" Expected results: No crash should be observed. Additional info: Attaching sosreport and core dump shortly. [root@rhs-client6 test]# ls -lrt total 12 drwxr-xr-x. 2 root root 4096 Jul 25 2018 dir1 drwxr-xr-x. 3 nfsnobody nfsnobody 4096 Jul 25 2018 mani drwxr-xr-x. 3 nfsnobody nfsnobody 4096 Jul 25 2018 run18512 ]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/rhel_rhs--client6-root 52403200 1771672 50631528 4% / devtmpfs 8105756 0 8105756 0% /dev tmpfs 8117824 0 8117824 0% /dev/shm tmpfs 8117824 9680 8108144 1% /run tmpfs 8117824 0 8117824 0% /sys/fs/cgroup /dev/sda1 1038336 219484 818852 22% /boot /dev/mapper/rhel_rhs--client6-home 1890846652 151868 1890694784 1% /home tmpfs 1623568 0 1623568 0% /run/user/0 10.70.34.91:/mani1 3251634176 34838528 3216795648 2% /mnt/test
This appears to be a use-after-free on the dirent, since there's a null check for the dirent immediately above that line. This code is heavily changed by the readdir_plus changes, and I believe it may have fixed this. Can it be re-tested on the next build?
Verified this with (Readdir disable in ganesha.conf) # rpm -qa | grep ganesha nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-10.el7rhgs.x86_64 nfs-ganesha-2.5.5-10.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-16.el7rhgs.x86_64 Steps performed for verification- 1.Create 6 node ganesha cluster 2.Create 6 x (4 + 2) Distributed-Disperse volume.Export the volume via Ganesha 3.Mount the volume to 1 client using 1st server VIP 4.Change the mount point permission to chmod 777 /mnt/mount_point 5.Enable root-squash.Run refresh-config 6.Create a directory named "mani" 7.Mount the subdir "mani" on 2nd client using 2nd server VIP 8.Mount the volume to 2 more clients using different VIP 9.Run IO's and lookups from all the 4 clients. All operations are run in "mani" directory 1st Client- Linux Untars 2nd Client- ls -lrt in loop 3rd Client- Subdir mounted (uuid:nfsnobody) - du -sh in loop 4th Client- Bonnie No crashes were observed.Moving this BZ to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2610