Description of problem: When huge number of directories are rm -rf'ed, the command errors out with ENOENT. Brick log snippet: =============== [2012-06-22 09:34:12.588948] I [server3_1-fops.c:907:server_setxattr_cbk] 0-scalability-1-server: 47933778: SETXATTR (null) (--) ==> trusted.glusterfs.dht (No such file or directory) [2012-06-22 09:42:23.265813] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/91/0d/910d3e55-6d23-472f-b95e-bcfc82d8b73d failed: No such file or directory [2012-06-22 09:42:23.265848] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48331226: STAT <gfid:910d3e55-6d23-472f-b95e-bcfc82d8b73d> (910d3e55-6d23-472f-b95e-bcfc82d8b73d) ==> -1 (No such file or directory) [2012-06-22 09:42:25.814130] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/10/2e/102e40ff-4498-44ae-a5cb-8003bff283c8 failed: No such file or directory [2012-06-22 09:42:25.814173] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48334848: STAT <gfid:102e40ff-4498-44ae-a5cb-8003bff283c8> (102e40ff-4498-44ae-a5cb-8003bff283c8) ==> -1 (No such file or directory) [2012-06-22 09:42:26.098540] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/f9/50/f9507285-9225-4170-a507-4eac03b5963c failed: No such file or directory [2012-06-22 09:42:26.098567] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48335279: STAT <gfid:f9507285-9225-4170-a507-4eac03b5963c> (f9507285-9225-4170-a507-4eac03b5963c) ==> -1 (No such file or directory) [2012-06-22 09:42:26.648949] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/f0/a3/f0a3e720-44ba-469a-a6d3-f2994699d3bf failed: No such file or directory [2012-06-22 09:42:26.648990] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48335954: STAT <gfid:ede2772b-ba8c-4a46-90f9-d8265bee6851>/fileop_L1_83/fileop_L1_83_L2_46/fileop_dir_83_46_49 (f0a3e720-44ba-469a-a6d3-f2994699d3bf) ==> -1 (No such file or directory) [2012-06-22 09:42:30.665662] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/5c/09/5c09d329-244d-4030-aa4f-02386b16963d failed: No such file or directory [2012-06-22 09:42:30.665703] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48340941: STAT <gfid:5c09d329-244d-4030-aa4f-02386b16963d> (5c09d329-244d-4030-aa4f-02386b16963d) ==> -1 (No such file or directory) =========================== Client logs: ============================== [2012-06-22 10:00:33.046009] E [nfs3-helpers.c:3603:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup fa iled: <gfid:5d693efa-d7c3-4328-8d18-c0bd77b5401b>: Invalid argument [2012-06-22 10:00:33.046041] E [nfs3.c:1513:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (10.16.15 7.39:901) scalability-1 : 5d693efa-d7c3-4328-8d18-c0bd77b5401b [2012-06-22 10:00:33.046052] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 7c44ea5f, ACCESS: NFS: 22(Invalid argument for operation), POSIX: 14(Bad address) [2012-06-22 10:00:33.505755] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-scalability-1-client-1: remote op eration failed: No such file or directory [2012-06-22 10:00:35.913556] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-scalability-1-client-2: remote op eration failed: No such file or directory [2012-06-22 10:00:43.086167] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-scalability-1-client-0: remote op eration failed: No such file or directory [2012-06-22 10:00:43.086200] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-scalability-1-client-2: remote op eration failed: No such file or directory ================================= Steps to Reproduce: 1. Run fileop such that it creates a million directory. Commandline: # fileop -s 50K -b -w -d `pwd` -t -f 100 Give it a day or more, so that it creates huge number of directories. Do a rm -rf on the directories. Attached sosreport from the server.
Created attachment 593689 [details] SOS report
There was only one client which was doing rm, so it is not possible that some other process would have deleted.
From the logs it looks like dht detects holes, and sends setxattr (layout) calls. But, by then rmdir would have succeeded, and hence setxattr fails with ENOENT. File-op also does clean-ups(rm), and the another manual rm -rf has been triggered. I suspect this to be the scenario: 1. readdir returns entries 2. fileop/manual rm both are both in progress 3. one of these sends a lookup, and other removes the non-hashed dir. this is when a hole in the layout is detected. 4. A heal/setxattr of layouts is triggered, which fails, as the rmdir would have succeed by now. If rm -rf fails, a new rm -rf on mount should clean successfully. Please try to reproduce the bug.
Can you please try to reproduce the bug with the latest git repo?
Unable to reproduce this issue. Will re-open again (with sosreport) if I can hit this issue again.