Description of problem: DHT- dist-rep volume - rm -rf is failing and giving error 'rm: cannot remove `<dir>': Is a directory Version-Release number of selected component (if applicable): 3.4.0.20rhs-2.el6rhs.x86_64 How reproducible: intermittent Steps to Reproduce: 1. had a dis-rep volume 3x2 as below [root@DVM1 ~]# gluster v info master1 Volume Name: master1 Type: Distributed-Replicate Volume ID: fa11e206-d039-4606-92fa-29f29a9a8dfa Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.70.37.128:/rhs/brick1 Brick2: 10.70.37.110:/rhs/brick1 Brick3: 10.70.37.192:/rhs/brick1 Brick4: 10.70.37.88:/rhs/brick1 Brick5: 10.70.37.81:/rhs/brick1 Brick6: 10.70.37.88:/rhs/brick5/2 Options Reconfigured: geo-replication.ignore-pid-check: on geo-replication.indexing: on changelog.encoding: ascii changelog.rollover-time: 15 changelog.fsync-interval: 3 [root@DVM1 ~]# gluster v status master1 Status of volume: master1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.128:/rhs/brick1 49152 Y 14019 Brick 10.70.37.110:/rhs/brick1 49152 Y 11927 Brick 10.70.37.192:/rhs/brick1 49152 Y 11868 Brick 10.70.37.88:/rhs/brick1 49152 Y 12242 Brick 10.70.37.81:/rhs/brick1 49152 Y 12462 Brick 10.70.37.88:/rhs/brick5/2 49153 Y 12253 NFS Server on localhost 2049 Y 27478 Self-heal Daemon on localhost N/A Y 14047 NFS Server on 10.70.37.81 2049 Y 24359 Self-heal Daemon on 10.70.37.81 N/A Y 12481 NFS Server on 10.70.37.192 2049 Y 24517 Self-heal Daemon on 10.70.37.192 N/A Y 11887 NFS Server on 10.70.37.110 2049 Y 23969 Self-heal Daemon on 10.70.37.110 N/A Y 11946 NFS Server on 10.70.37.88 2049 Y 24525 Self-heal Daemon on 10.70.37.88 N/A Y 12272 There are no active volume tasks 2. tried to delete dir and its content from mount point [root@rhs-client22 1]# mount | grep master1 10.70.37.128:master1 on /mnt/master1 type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) 10.70.37.128:/master1 on /mnt/master1nfs type nfs (rw,addr=10.70.37.128) [root@rhs-client22 nufa]# cd /mnt/master1/n1/1 [root@rhs-client22 1]# rm -rf etc4* rm: cannot remove `etc40/httpd/conf.d': Is a directory rm: cannot remove `etc40/polkit-1/localauthority/20-org.d': Is a directory rm: cannot remove `etc40/polkit-1/localauthority/50-local.d': Is a directory rm: cannot remove `etc40/polkit-1/localauthority/30-site.d': Is a directory rm: cannot remove `etc40/polkit-1/localauthority/10-vendor.d': Is a directory rm: cannot remove `etc40/polkit-1/localauthority.conf.d': Is a directory rm: cannot remove `etc40/sysconfig/rhn/clientCaps.d': Is a directory rm: cannot remove `etc40/selinux/targeted/logins': Is a directory rm: cannot remove `etc40/selinux/targeted/policy': Is a directory rm: cannot remove `etc40/X11/applnk': Is a directory ^C [root@rhs-client22 1]# ls etc40/X11 [root@rhs-client22 1]# ls etc40/X11/applnk ls: cannot access etc40/X11/applnk: No such file or directory [root@rhs-client22 1]# ls etc40/X11/ 3. verified on all bricks brick1 :- [root@DVM1 1]# pwd /rhs/brick1/n1/1 [root@DVM1 1]# ls etc40/X11 brick2:- [root@DVM2 1]# pwd /rhs/brick1/n1/1 [root@DVM2 1]# ls etc40/X11 brick3:- [root@DVM4 1]# ls etc40/X11 ls: cannot access etc40/X11: No such file or directory [root@DVM4 1]# pwd /rhs/brick1/n1/1 brick4:- [root@DVM4 1]# ls etc40/X11 ls: cannot access etc40/X11: No such file or directory [root@DVM4 1]# pwd /rhs/brick1/n1/1 brick5:- [root@DVM5 1]# pwd /rhs/brick1/n1/1 [root@DVM5 1]# ls etc40/X11 [root@DVM5 1]# brick6:- [root@DVM6 1]# cd /rhs/brick5/2 [root@DVM6 2]# ls etc40/X11 ls: cannot access etc40/X11: No such file or directory Actual results: rm -rf is failing and giving error 'rm: cannot remove `<dir>': Is a directory Expected results: rm -rf should remove entire directory structure and should not give error - Is as Directory Additional info: mount log :- less mnt-master1.log | grep '2013-08-21 09:38:40' >> /tmp/log1 <Snippet> [2013-08-21 09:38:40.317602] D [afr-common.c:1388:afr_lookup_select_read_child] 0-master1-replicate-2: Source selected as 0 for /n1/1/e tc40/X11 [2013-08-21 09:38:40.317618] D [afr-common.c:1125:afr_lookup_build_response_params] 0-master1-replicate-2: Building lookup response fro m 0 [2013-08-21 09:38:40.317652] T [io-cache.c:224:ioc_lookup_cbk] 0-master1-io-cache: locked inode(0x1bde9e30) [2013-08-21 09:38:40.317669] T [io-cache.c:233:ioc_lookup_cbk] 0-master1-io-cache: unlocked inode(0x1bde9e30) [2013-08-21 09:38:40.317682] T [io-cache.c:128:ioc_inode_flush] 0-master1-io-cache: locked inode(0x1bde9e30) [2013-08-21 09:38:40.317708] T [io-cache.c:132:ioc_inode_flush] 0-master1-io-cache: unlocked inode(0x1bde9e30) [2013-08-21 09:38:40.317722] T [io-cache.c:242:ioc_lookup_cbk] 0-master1-io-cache: locked table(0xac0c70) [2013-08-21 09:38:40.317735] T [io-cache.c:247:ioc_lookup_cbk] 0-master1-io-cache: unlocked table(0xac0c70) [2013-08-21 09:38:40.317770] T [fuse-bridge.c:516:fuse_entry_cbk] 0-glusterfs-fuse: 14998269: LOOKUP() /n1/1/etc40/X11 => -809601562649 8867065 [2013-08-21 09:38:40.317888] T [fuse-resolve.c:53:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 22 [2013-08-21 09:38:40.317939] T [fuse-bridge.c:650:fuse_lookup_resume] 0-glusterfs-fuse: 14998270: LOOKUP /n1/1/etc40/X11/applnk(a6b1460 a-543f-4656-8f34-8585154fd0ea) [2013-08-21 09:38:40.317992] T [dht-hashfn.c:97:dht_hash_compute] 0-master1-dht: trying regex for applnk [2013-08-21 09:38:40.318041] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 0-master1-replicate-0: /n1/1/etc40/X11/applnk: failed to get the gfid from dict [2013-08-21 09:38:40.318081] T [rpc-clnt.c:1307:rpc_clnt_record] 0-master1-client-0: Auth Info: pid: 19729, uid: 0, gid: 0, owner: 0000 000000000000 ... [2013-08-21 09:38:40.318335] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 0-master1-replicate-1: /n1/1/etc40/X11/applnk: failed to get the gfid from dict ... [2013-08-21 09:38:40.326204] T [fuse-bridge.c:567:fuse_entry_cbk] 0-glusterfs-fuse: 14998272: LOOKUP() /n1/1/etc40/X11/applnk => -1 (No such file or directory) [2013-08-21 09:38:40.326333] T [fuse-resolve.c:53:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 22 [2013-08-21 09:38:40.326378] T [fuse-bridge.c:655:fuse_lookup_resume] 0-glusterfs-fuse: 14998273: LOOKUP /n1/1/etc40/X11/applnk [2013-08-21 09:38:40.326455] T [dht-hashfn.c:97:dht_hash_compute] 0-master1-dht: trying regex for applnk ... [2013-08-21 09:38:40.329214] T [fuse-bridge.c:567:fuse_entry_cbk] 0-glusterfs-fuse: 14998273: LOOKUP() /n1/1/etc40/X11/applnk => -1 (No such file or directory)
Targeting for 3.0.0 (Denali) release.
From the log: After metadata self heal is completed, "No such file or directory " was seen in the log for file. [2013-08-21 09:34:31.884434] I [afr-self-heal-common.c:2744:afr_log_self_heal_completion_status] 0-master1-replicate-2: metadata self heal is successfully completed, entry self heal is successfully completed, on /n1/1/etc40/X11/applnk [2013-08-21 09:34:31.884544] D [afr-common.c:1388:afr_lookup_select_read_child] 0-master1-replicate-2: Source selected as 0 for /n1/1/etc40/X11/applnk [2013-08-21 09:34:31.884717] T [fuse-bridge.c:516:fuse_entry_cbk] 0-glusterfs-fuse: 14997409: LOOKUP() /n1/1/etc40/X11/applnk => -8127724620862205718 [2013-08-21 09:34:31.884811] T [fuse-resolve.c:53:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 22 [2013-08-21 09:34:31.884843] T [fuse-bridge.c:2936:fuse_opendir_resume] 0-glusterfs-fuse: 14997410: OPENDIR /n1/1/etc40/X11/applnk [2013-08-21 09:34:31.885937] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-master1-client-2: remote operation failed: No such file or directory. Path: /n1/1/etc40/X11/applnk (a6b1460a-543f-4656-8f34-8585154fd0ea) [2013-08-21 09:34:31.886024] T [afr-dir-read.c:270:afr_opendir_cbk] 0-master1-replicate-0: reading contents of directory /n1/1/etc40/X11/applnk looking for mismatch [2013-08-21 09:34:31.887828] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-master1-replicate-2: /n1/1/etc40/X11/applnk: no entries found in master1-client-4 [2013-08-21 09:34:31.888028] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-master1-replicate-0: /n1/1/etc40/X11/applnk: no entries found in master1-client-0 [2013-08-21 09:34:31.888141] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-master1-client-1: received rpc message (RPC XID: 0x7385201x Program: GlusterFS 3.3, ProgVers: 330, Proc: 28) from rpc-transport (master1-client-1) [2013-08-21 09:34:31.888172] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-master1-replicate-0: /n1/1/etc40/X11/applnk: no entries found in master1-client-1 [2013-08-21 09:34:31.888437] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-master1-client-5: received rpc message (RPC XID: 0x12061146x Program: GlusterFS 3.3, ProgVers: 330, Proc: 28) from rpc-transport (master1-client-5) [2013-08-21 09:34:31.888487] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-master1-replicate-2: /n1/1/etc40/X11/applnk: no entries found in master1-client-5 [2013-08-21 09:34:31.888532] T [fuse-bridge.c:1337:fuse_fd_cbk] 0-glusterfs-fuse: 14997410: OPENDIR() /n1/1/etc40/X11/applnk => 0xbc290c [2013-08-21 09:34:31.892753] T [fuse-bridge.c:2037:fuse_rmdir_resume] 0-glusterfs-fuse: 14997415: RMDIR /n1/1/etc40/X11/applnk [2013-08-21 09:34:31.893865] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-master1-client-2: remote operation failed: No such file or directory. Path: /n1/1/etc40/X11/applnk (a6b1460a-543f-4656-8f34-8585154fd0ea [2013-08-21 09:34:31.893940] D [dht-common.c:4816:dht_rmdir_opendir_cbk] 0-master1-dht: opendir on master1-replicate-1 for /n1/1/etc40/X11/applnk failed (No such file or directory) [2013-08-21 09:34:31.901125] W [fuse-bridge.c:1688:fuse_unlink_cbk] 0-glusterfs-fuse: 14997415: RMDIR() /n1/1/etc40/X11/applnk => -1 (No such file or directory) It is not clear how the directory entry was removed from backend from all subvols. I tried to reproduce the result with plain DHT and replica, but was not able to reproduce. (My volume had no geo-rep configuration) Rachana, Can you try to reproduce the bug again with and with out geo-rep enabled ? If it is reproducible only with geo-rep configuration enabled, should we move the component to geo-rep ?
Rachana, The bug could have resulted only if an unlink call happens for a directory. I and Pranith tried different test cases on dht for reproducing the bug, but we couldn't reproduce. Hence, can you come up with a test case for reproducing the bug ?
I also tried but no specific Test case, not always reproducible.
Dev ack to 3.0 RHS BZs
Sent one possible fix for the bug: http://review.gluster.org/#/c/7733/. The fix addresses the following issue. * POSIX_READDIRP function fills the stat information of all the entries present in the directory. If lstat of an entry fails, it used to fill the stat information of the current file with that of the the previous entry read. e.g let say the current entry was a file and the previous entry read was a directory. And if the lstat of current file failed, the stat info for current file will be filled with that of the previous directory. Hence, the file will be treated as a directory. Now one of the following two scenario may happen as dht_readdirp takes directory entry only from the first up subvolume. 1) If the file (now a directory for dht because of wrong stat) is not present on the first_up_subvolume, then it won't be processed for deletion. 2) Even if it is present on first_up_subvolume, a rmdir call will go for the file(corrupted stat) which will result in to "Not a directory" ERROR. And we will see a "Directory Not Empty" error while trying to unlink the parent directory. *** This bug has been marked as a duplicate of bug 960910 ***
Marked duplicate as the fix :http://review.gluster.org/#/c/7733/ is a possible fix for both "Directory Not empty" and "Is a directory" error. Please reopen this bug if reproduced in future.
got this error once with build 3.6.0.24-1.el6rhs.x86_64 logs got cleared but will try to reproduce again and upload the logs.
*** Bug 1115379 has been marked as a duplicate of this bug. ***
triage-update: Dev will test it out and take call after that.