Description of problem: After performing linux untars from 4 clients and simultaneously performing lookups on mount point from all 4 clients,rm -rf * unable to delete few files from mount point. Initially rm -rf * is performed from all the 4 clients simultaneously on same dirs across mount points.It got failed for few files. Then again performed rm -rf * from each client one by one.Removal of files from client failed even after several attempts. # rm -rf * rm: cannot remove ‘dir4/linux-4.9.5/drivers/acpi/nfit’: Directory not empty rm: cannot remove ‘dir4/linux-4.9.5/Documentation/devicetree/bindings/iio/temperature’: Directory not empty Version-Release number of selected component (if applicable): # rpm -qa | grep ganesha nfs-ganesha-gluster-2.5.5-3.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-5.el7rhgs.x86_64 nfs-ganesha-2.5.5-3.el7rhgs.x86_64 How reproducible: Reporting first instance Steps to Reproduce: 1.Create 4 node ganesha cluster 2.Create 4*3 Distributed-Replicate volume.Export the volume 3.Mount volume to 4 different clients using 4 VIP's.Each server node VIP mapped to each client mount 4.Create 4 directories on mount point. 5.Run linux untars from 4 clients to 4 different directories.At the same time perform lookups from all the 4 clients on mount point 6.Perform rm -rf * from 4 clients on same mount point simultaneously when lookups are running in parallel #cd /mnt/ganesha_mount #rm -rf * Actual results: Unable to delete few files from mount point after performing several attempts of rm -rf * from all client at once/single client. ganesha_mount]# rm -rf * rm: cannot remove ‘dir4/linux-4.9.5/drivers/acpi/nfit’: Directory not empty rm: cannot remove ‘dir4/linux-4.9.5/Documentation/devicetree/bindings/iio/temperature’: Directory not empty Expected results: All files should be removed from mount points Additional info: On one of the node tailf /var/log/ganesha/ganesha.log 1/03/2018 10:09:47 : epoch 16060000 : dhcp37-103.lab.eng.blr.redhat.com : ganesha.nfsd-2431[work-50] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE 21/03/2018 11:40:15 : epoch 16060000 : dhcp37-103.lab.eng.blr.redhat.com : ganesha.nfsd-2431[work-104] mdcache_avl_qp_insert :INODE :WARN :Duplicate filename adaptec with different cookies ckey 127 chunk 0x7fb1341f9f60 don't match existing ckey 10f chunk 0x7fb13811e390 21/03/2018 11:40:15 : epoch 16060000 : dhcp37-103.lab.eng.blr.redhat.com : ganesha.nfsd-2431[work-104] mdc_readdir_chunk_object :INODE :CRIT :Collision while adding dirent for adaptec Unable to find any specific log errors in ganesha-gfapi.log causing failure. Attaching tcpdumps (taken of server and client while performing rm -rf * causing failure) and sosreports.
Ah, you've managed to find a test that hits what I suspected could be an issue... The problem is the way directory cookies are generated... With POSIX readdir system call (actually getdents now), the d_off value that we use as the cookie is NOT the "address" of the entry in question, it's the "address" of the NEXT dirent with a single readdir going on and no files being removed, that works just fine. With a lot of churn in the directory, there is the possibility that a file gets added between the dirent that has a particular d_off and the next dirent that is actually at that d_off. Now the new file has the same d_off or with multiple readdir, let's say the files are: "first" (100) "new" (200) "xlast" (300) (note that d_off is probably not actually alpha order, just pretending it is to make it easier to follow the example, numbers in parenthesis are the actual addresses of each dirent so a readdir happening AFTER "new" has already been added would show: "first" (d_off = 200) "new" (d_off = 300) "xlast" (d_off = 400) A readdir happening BEFORE "new" is added would show: "first" (d_off = 300) "xlast" (d_off = 400) So now if we had a BEFORE readdir followed by an AFTER readdir see how the cookie for "first" has changed, also see how the cookie for "new" is a duplicate of the cookie for the original instance of "first". It happens there's a way to fix this... The FSAL readdir keeps track of the PREVIOUS d_off/cookie for each dirent and uses that one which is the actual "address" of the dirent and now each dirent has a deterministic cookie under most modern filesystems (it's actually a hash value of the file name rather than an offset into a directory flat file). I used this mechanism in this patch: https://review.gerrithub.io/#/c/354400/ That patch implements a brute force compute_readdir_cookie operation and to have a consistent cookie for an entry, relies of use of the d_off from the previous dirent.
So this has been run with my proposed patch? Are we still seeing the WARN and CRIT messages?
Kaleb mentioned he was able to recreate this with a single client doing untar followed by rm -Rf but could not duplicate with FSAL_VFS. This suggests to me that the issue is in FSAL_GLUSTER or libgfapi.
If you unmount and remount the client, does that fix the issue?
Observing this issue with readdir disable build as well i.e # rpm -qa | grep ganesha nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-10.el7rhgs.x86_64 nfs-ganesha-2.5.5-10.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-16.el7rhgs.x86_64 ----------------- [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# ls dir2 [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty [root@rhs-client9 ganesha]# rm -rf * rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty -------------------
With this, it makes it very likely this is caused by bug #1458215