Bug 1558974 - [Ganesha] Unable to delete few files from mount point while performing rm -rf post linux untars and lookups
Summary: [Ganesha] Unable to delete few files from mount point while performing rm -rf...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-21 12:36 UTC by Manisha Saini
Modified: 2019-10-25 04:34 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-06 12:04:02 UTC
Embargoed:


Attachments (Terms of Use)

Description Manisha Saini 2018-03-21 12:36:54 UTC
Description of problem:

After performing linux untars from 4 clients and simultaneously performing lookups on mount point from all 4 clients,rm -rf * unable to delete few files from mount point.
Initially rm -rf * is performed from all the 4 clients simultaneously on same dirs across mount points.It got failed for few files.

Then again performed rm -rf * from each client one by one.Removal of files from client failed even after several attempts.

# rm -rf *
rm: cannot remove ‘dir4/linux-4.9.5/drivers/acpi/nfit’: Directory not empty
rm: cannot remove ‘dir4/linux-4.9.5/Documentation/devicetree/bindings/iio/temperature’: Directory not empty



Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
nfs-ganesha-gluster-2.5.5-3.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-5.el7rhgs.x86_64
nfs-ganesha-2.5.5-3.el7rhgs.x86_64



How reproducible:
Reporting first instance


Steps to Reproduce:
1.Create 4 node ganesha cluster
2.Create 4*3 Distributed-Replicate volume.Export the volume 
3.Mount volume to 4 different clients using 4 VIP's.Each server node VIP mapped to each client mount
4.Create 4 directories on mount point.
5.Run linux untars from 4 clients to 4 different directories.At the same time perform lookups from all the 4 clients on mount point
6.Perform rm -rf * from 4 clients on same mount point simultaneously when lookups are running in parallel 
#cd /mnt/ganesha_mount
#rm -rf *

Actual results:
Unable to delete few files from mount point after performing several attempts of rm -rf * from all client at once/single client.

ganesha_mount]# rm -rf *
rm: cannot remove ‘dir4/linux-4.9.5/drivers/acpi/nfit’: Directory not empty
rm: cannot remove ‘dir4/linux-4.9.5/Documentation/devicetree/bindings/iio/temperature’: Directory not empty



Expected results:
All files should be removed from mount points


Additional info:

On one of the node 

tailf /var/log/ganesha/ganesha.log

1/03/2018 10:09:47 : epoch 16060000 : dhcp37-103.lab.eng.blr.redhat.com : ganesha.nfsd-2431[work-50] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE
21/03/2018 11:40:15 : epoch 16060000 : dhcp37-103.lab.eng.blr.redhat.com : ganesha.nfsd-2431[work-104] mdcache_avl_qp_insert :INODE :WARN :Duplicate filename adaptec with different cookies ckey 127 chunk 0x7fb1341f9f60 don't match existing ckey 10f chunk 0x7fb13811e390
21/03/2018 11:40:15 : epoch 16060000 : dhcp37-103.lab.eng.blr.redhat.com : ganesha.nfsd-2431[work-104] mdc_readdir_chunk_object :INODE :CRIT :Collision while adding dirent for adaptec

Unable to find any specific log errors in ganesha-gfapi.log causing failure.

Attaching tcpdumps (taken of server and client while performing rm -rf * causing failure) and sosreports.

Comment 3 Frank Filz 2018-03-21 14:21:50 UTC
Ah, you've managed to find a test that hits what I suspected could be an issue...

The problem is the way directory cookies are generated... With POSIX readdir system call (actually getdents now), the d_off value that we use as the cookie is NOT the "address" of the entry in question, it's the "address" of the NEXT dirent with a single readdir going on and no files being removed, that works just fine. With a lot of churn in the directory, there is the possibility that a file gets added between the dirent that has a particular d_off and the next dirent that is actually at that d_off. Now the new file has the same d_off
or with multiple readdir, let's say the files are:

"first" (100)
"new" (200)
"xlast" (300)

(note that d_off is probably not actually alpha order, just pretending it is to make it easier to follow the example, numbers in parenthesis are the actual addresses of each dirent so a readdir happening AFTER "new" has already been added would show:

"first" (d_off = 200)
"new" (d_off = 300)
"xlast" (d_off = 400)

A readdir happening BEFORE "new" is added would show:

"first" (d_off = 300)
"xlast" (d_off = 400)

So now if we had a BEFORE readdir followed by an AFTER readdir see how the cookie for "first" has changed, also see how the cookie for "new" is a duplicate of the cookie for the original instance of "first".

It happens there's a way to fix this... The FSAL readdir keeps track of the PREVIOUS d_off/cookie for each dirent and uses that one which is the actual "address" of the dirent and now each dirent has a deterministic cookie under most modern filesystems (it's actually a hash value of the file name rather than an offset into a directory flat file).

I used this mechanism in this patch:

https://review.gerrithub.io/#/c/354400/

That patch implements a brute force compute_readdir_cookie operation and to have a consistent cookie for an entry, relies of use of the d_off from the previous dirent.

Comment 7 Frank Filz 2018-04-04 20:51:23 UTC
So this has been run with my proposed patch? Are we still seeing the WARN and CRIT messages?

Comment 10 Frank Filz 2018-04-19 21:25:45 UTC
Kaleb mentioned he was able to recreate this with a single client doing untar followed by rm -Rf but could not duplicate with FSAL_VFS. This suggests to me that the issue is in FSAL_GLUSTER or libgfapi.

Comment 18 Daniel Gryniewicz 2018-07-12 13:02:18 UTC
If you unmount and remount the client, does that fix the issue?

Comment 21 Manisha Saini 2018-08-22 07:25:11 UTC
Observing this issue with readdir disable build as well i.e

# rpm -qa | grep ganesha
nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-10.el7rhgs.x86_64
nfs-ganesha-2.5.5-10.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-16.el7rhgs.x86_64


-----------------
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# ls
dir2
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
[root@rhs-client9 ganesha]# rm -rf *
rm: cannot remove ‘dir2/linux-4.9.5/tools/lib/lockdep/uinclude/linux’: Directory not empty
-------------------

Comment 22 Frank Filz 2018-08-22 14:07:05 UTC
With this, it makes it very likely this is caused by bug #1458215


Note You need to log in before you can comment on or make changes to this bug.