Bug 999496 - DHT- dist-rep volume - rm -rf is failing and giving error 'rm: cannot remove `<dir>': Is a directory
Summary: DHT- dist-rep volume - rm -rf is failing and giving error 'rm: cannot remove ...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Susant Kumar Palai
QA Contact: RajeshReddy
URL:
Whiteboard: dht-rm-rf , triaged, dht-try-latest-b...
: 1115379 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-21 12:20 UTC by Rachana Patel
Modified: 2016-06-08 04:47 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-08 04:47:26 UTC
Embargoed:


Attachments (Terms of Use)

Description Rachana Patel 2013-08-21 12:20:11 UTC
Description of problem:
DHT- dist-rep volume - rm -rf is failing and giving error 'rm: cannot remove `<dir>': Is a directory

Version-Release number of selected component (if applicable):
3.4.0.20rhs-2.el6rhs.x86_64

How reproducible:
intermittent

Steps to Reproduce:
1. had a dis-rep volume 3x2 as below
[root@DVM1 ~]# gluster v info master1
 
Volume Name: master1
Type: Distributed-Replicate
Volume ID: fa11e206-d039-4606-92fa-29f29a9a8dfa
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.128:/rhs/brick1
Brick2: 10.70.37.110:/rhs/brick1
Brick3: 10.70.37.192:/rhs/brick1
Brick4: 10.70.37.88:/rhs/brick1
Brick5: 10.70.37.81:/rhs/brick1
Brick6: 10.70.37.88:/rhs/brick5/2
Options Reconfigured:
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
changelog.encoding: ascii
changelog.rollover-time: 15
changelog.fsync-interval: 3
[root@DVM1 ~]# gluster v status master1
Status of volume: master1
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.37.128:/rhs/brick1				49152	Y	14019
Brick 10.70.37.110:/rhs/brick1				49152	Y	11927
Brick 10.70.37.192:/rhs/brick1				49152	Y	11868
Brick 10.70.37.88:/rhs/brick1				49152	Y	12242
Brick 10.70.37.81:/rhs/brick1				49152	Y	12462
Brick 10.70.37.88:/rhs/brick5/2				49153	Y	12253
NFS Server on localhost					2049	Y	27478
Self-heal Daemon on localhost				N/A	Y	14047
NFS Server on 10.70.37.81				2049	Y	24359
Self-heal Daemon on 10.70.37.81				N/A	Y	12481
NFS Server on 10.70.37.192				2049	Y	24517
Self-heal Daemon on 10.70.37.192			N/A	Y	11887
NFS Server on 10.70.37.110				2049	Y	23969
Self-heal Daemon on 10.70.37.110			N/A	Y	11946
NFS Server on 10.70.37.88				2049	Y	24525
Self-heal Daemon on 10.70.37.88				N/A	Y	12272
 
There are no active volume tasks

2. tried to delete dir and its content from mount point
[root@rhs-client22 1]# mount | grep master1
10.70.37.128:master1 on /mnt/master1 type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
10.70.37.128:/master1 on /mnt/master1nfs type nfs (rw,addr=10.70.37.128)

[root@rhs-client22 nufa]# cd /mnt/master1/n1/1
[root@rhs-client22 1]# rm -rf etc4*
rm: cannot remove `etc40/httpd/conf.d': Is a directory
rm: cannot remove `etc40/polkit-1/localauthority/20-org.d': Is a directory
rm: cannot remove `etc40/polkit-1/localauthority/50-local.d': Is a directory
rm: cannot remove `etc40/polkit-1/localauthority/30-site.d': Is a directory
rm: cannot remove `etc40/polkit-1/localauthority/10-vendor.d': Is a directory
rm: cannot remove `etc40/polkit-1/localauthority.conf.d': Is a directory
rm: cannot remove `etc40/sysconfig/rhn/clientCaps.d': Is a directory
rm: cannot remove `etc40/selinux/targeted/logins': Is a directory
rm: cannot remove `etc40/selinux/targeted/policy': Is a directory
rm: cannot remove `etc40/X11/applnk': Is a directory
^C
[root@rhs-client22 1]# ls etc40/X11
[root@rhs-client22 1]# ls etc40/X11/applnk
ls: cannot access etc40/X11/applnk: No such file or directory
[root@rhs-client22 1]# ls etc40/X11/

3. verified on all bricks

brick1 :-
[root@DVM1 1]# pwd
/rhs/brick1/n1/1
[root@DVM1 1]# ls etc40/X11

brick2:-
[root@DVM2 1]# pwd
/rhs/brick1/n1/1
[root@DVM2 1]# ls etc40/X11

brick3:-
[root@DVM4 1]# ls etc40/X11
ls: cannot access etc40/X11: No such file or directory
[root@DVM4 1]# pwd
/rhs/brick1/n1/1

brick4:-
[root@DVM4 1]# ls etc40/X11
ls: cannot access etc40/X11: No such file or directory
[root@DVM4 1]# pwd
/rhs/brick1/n1/1

brick5:-
[root@DVM5 1]# pwd
/rhs/brick1/n1/1
[root@DVM5 1]# ls etc40/X11
[root@DVM5 1]# 

brick6:-
[root@DVM6 1]# cd /rhs/brick5/2
[root@DVM6 2]# ls etc40/X11
ls: cannot access etc40/X11: No such file or directory


Actual results:
 rm -rf is failing and giving error 'rm: cannot remove `<dir>': Is a directory

Expected results:
rm -rf should remove entire directory structure and should not give error - Is as Directory

Additional info:

mount log :-

less mnt-master1.log | grep '2013-08-21 09:38:40' >> /tmp/log1

<Snippet>

[2013-08-21 09:38:40.317602] D [afr-common.c:1388:afr_lookup_select_read_child] 0-master1-replicate-2: Source selected as 0 for /n1/1/e
tc40/X11
[2013-08-21 09:38:40.317618] D [afr-common.c:1125:afr_lookup_build_response_params] 0-master1-replicate-2: Building lookup response fro
m 0
[2013-08-21 09:38:40.317652] T [io-cache.c:224:ioc_lookup_cbk] 0-master1-io-cache: locked inode(0x1bde9e30)
[2013-08-21 09:38:40.317669] T [io-cache.c:233:ioc_lookup_cbk] 0-master1-io-cache: unlocked inode(0x1bde9e30)
[2013-08-21 09:38:40.317682] T [io-cache.c:128:ioc_inode_flush] 0-master1-io-cache: locked inode(0x1bde9e30)
[2013-08-21 09:38:40.317708] T [io-cache.c:132:ioc_inode_flush] 0-master1-io-cache: unlocked inode(0x1bde9e30)
[2013-08-21 09:38:40.317722] T [io-cache.c:242:ioc_lookup_cbk] 0-master1-io-cache: locked table(0xac0c70)
[2013-08-21 09:38:40.317735] T [io-cache.c:247:ioc_lookup_cbk] 0-master1-io-cache: unlocked table(0xac0c70)
[2013-08-21 09:38:40.317770] T [fuse-bridge.c:516:fuse_entry_cbk] 0-glusterfs-fuse: 14998269: LOOKUP() /n1/1/etc40/X11 => -809601562649
8867065
[2013-08-21 09:38:40.317888] T [fuse-resolve.c:53:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 22
[2013-08-21 09:38:40.317939] T [fuse-bridge.c:650:fuse_lookup_resume] 0-glusterfs-fuse: 14998270: LOOKUP /n1/1/etc40/X11/applnk(a6b1460
a-543f-4656-8f34-8585154fd0ea)
[2013-08-21 09:38:40.317992] T [dht-hashfn.c:97:dht_hash_compute] 0-master1-dht: trying regex for applnk
[2013-08-21 09:38:40.318041] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 0-master1-replicate-0: /n1/1/etc40/X11/applnk: failed to
 get the gfid from dict
[2013-08-21 09:38:40.318081] T [rpc-clnt.c:1307:rpc_clnt_record] 0-master1-client-0: Auth Info: pid: 19729, uid: 0, gid: 0, owner: 0000
000000000000

...

[2013-08-21 09:38:40.318335] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 0-master1-replicate-1: /n1/1/etc40/X11/applnk: failed to
 get the gfid from dict
...

[2013-08-21 09:38:40.326204] T [fuse-bridge.c:567:fuse_entry_cbk] 0-glusterfs-fuse: 14998272: LOOKUP() /n1/1/etc40/X11/applnk => -1 (No
 such file or directory)
[2013-08-21 09:38:40.326333] T [fuse-resolve.c:53:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 22
[2013-08-21 09:38:40.326378] T [fuse-bridge.c:655:fuse_lookup_resume] 0-glusterfs-fuse: 14998273: LOOKUP /n1/1/etc40/X11/applnk
[2013-08-21 09:38:40.326455] T [dht-hashfn.c:97:dht_hash_compute] 0-master1-dht: trying regex for applnk

...

[2013-08-21 09:38:40.329214] T [fuse-bridge.c:567:fuse_entry_cbk] 0-glusterfs-fuse: 14998273: LOOKUP() /n1/1/etc40/X11/applnk => -1 (No such file or directory)

Comment 3 Scott Haines 2013-09-27 17:08:03 UTC
Targeting for 3.0.0 (Denali) release.

Comment 5 Susant Kumar Palai 2014-03-25 08:18:24 UTC
From the log: After metadata self heal is completed, "No such file or directory " was seen in the log for file.

[2013-08-21 09:34:31.884434] I [afr-self-heal-common.c:2744:afr_log_self_heal_completion_status] 0-master1-replicate-2:  metadata self heal  is successfully completed, entry self heal  is successfully completed, on /n1/1/etc40/X11/applnk

[2013-08-21 09:34:31.884544] D [afr-common.c:1388:afr_lookup_select_read_child] 0-master1-replicate-2: Source selected as 0 for /n1/1/etc40/X11/applnk
   
[2013-08-21 09:34:31.884717] T [fuse-bridge.c:516:fuse_entry_cbk] 0-glusterfs-fuse: 14997409: LOOKUP() /n1/1/etc40/X11/applnk => -8127724620862205718

[2013-08-21 09:34:31.884811] T [fuse-resolve.c:53:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 22

[2013-08-21 09:34:31.884843] T [fuse-bridge.c:2936:fuse_opendir_resume] 0-glusterfs-fuse: 14997410: OPENDIR /n1/1/etc40/X11/applnk

[2013-08-21 09:34:31.885937] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-master1-client-2: remote operation failed: No such file or directory. Path: /n1/1/etc40/X11/applnk (a6b1460a-543f-4656-8f34-8585154fd0ea)

[2013-08-21 09:34:31.886024] T [afr-dir-read.c:270:afr_opendir_cbk] 0-master1-replicate-0: reading contents of directory /n1/1/etc40/X11/applnk looking for mismatch

[2013-08-21 09:34:31.887828] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-master1-replicate-2: /n1/1/etc40/X11/applnk: no entries found in master1-client-4

[2013-08-21 09:34:31.888028] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-master1-replicate-0: /n1/1/etc40/X11/applnk: no entries found in master1-client-0
[2013-08-21 09:34:31.888141] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-master1-client-1: received rpc message (RPC XID: 0x7385201x Program: GlusterFS 3.3, ProgVers: 330, Proc: 28) from rpc-transport (master1-client-1)
[2013-08-21 09:34:31.888172] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-master1-replicate-0: /n1/1/etc40/X11/applnk: no entries found in master1-client-1
[2013-08-21 09:34:31.888437] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-master1-client-5: received rpc message (RPC XID: 0x12061146x Program: GlusterFS 3.3, ProgVers: 330, Proc: 28) from rpc-transport (master1-client-5)
[2013-08-21 09:34:31.888487] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-master1-replicate-2: /n1/1/etc40/X11/applnk: no entries found in master1-client-5
[2013-08-21 09:34:31.888532] T [fuse-bridge.c:1337:fuse_fd_cbk] 0-glusterfs-fuse: 14997410: OPENDIR() /n1/1/etc40/X11/applnk => 0xbc290c

[2013-08-21 09:34:31.892753] T [fuse-bridge.c:2037:fuse_rmdir_resume] 0-glusterfs-fuse: 14997415: RMDIR /n1/1/etc40/X11/applnk

[2013-08-21 09:34:31.893865] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-master1-client-2: remote operation failed: No such file or directory. Path: /n1/1/etc40/X11/applnk (a6b1460a-543f-4656-8f34-8585154fd0ea

[2013-08-21 09:34:31.893940] D [dht-common.c:4816:dht_rmdir_opendir_cbk] 0-master1-dht: opendir on master1-replicate-1 for /n1/1/etc40/X11/applnk failed (No such file or directory)

[2013-08-21 09:34:31.901125] W [fuse-bridge.c:1688:fuse_unlink_cbk] 0-glusterfs-fuse: 14997415: RMDIR() /n1/1/etc40/X11/applnk => -1 (No such file or directory)


 It is not clear how the directory entry was removed from backend from all subvols.

 I tried to reproduce the result with plain DHT and replica, but was not able to reproduce. (My volume had no geo-rep configuration)


Rachana,
  Can you try to reproduce the bug again with and with out geo-rep enabled ?

  If it is reproducible only with geo-rep configuration enabled, should we move the component to geo-rep ?

Comment 6 Susant Kumar Palai 2014-03-28 06:35:09 UTC
Rachana,
   The bug could have resulted only if an unlink call happens for a directory.
I and Pranith tried different test cases on dht for reproducing the bug, but we couldn't reproduce. Hence, can you come up with a test case for reproducing the bug ?

Comment 7 Rachana Patel 2014-04-23 07:05:23 UTC
I also tried but no specific Test case, not always reproducible.

Comment 8 Nagaprasad Sathyanarayana 2014-05-06 11:43:42 UTC
Dev ack to 3.0 RHS BZs

Comment 9 Susant Kumar Palai 2014-06-03 08:54:19 UTC
Sent one possible fix for the bug: http://review.gluster.org/#/c/7733/.
The fix addresses the following issue.

* POSIX_READDIRP function fills the stat information of all the entries present in the directory. If lstat of an entry fails, it used to fill the stat information of the current file with that of the the previous entry read.  

e.g let say the current entry was a file and the previous entry read was a directory. And if the lstat of current file failed, the stat info for current file will be filled with that of the previous directory. Hence, the file will be treated as a directory.

Now one of the following two scenario may happen as dht_readdirp takes directory entry only from the first up subvolume.

1) If the file (now a directory for dht because of wrong stat) is not present on the first_up_subvolume, then it won't be processed for deletion.

2) Even if it is present on first_up_subvolume, a rmdir call will go for the file(corrupted stat) which will result in to  "Not a directory" ERROR.


And we will see a "Directory Not Empty" error while trying to unlink the parent directory.

*** This bug has been marked as a duplicate of bug 960910 ***

Comment 10 Susant Kumar Palai 2014-06-03 08:56:48 UTC
Marked duplicate as the fix :http://review.gluster.org/#/c/7733/  is a possible fix for both "Directory Not empty" and "Is a directory" error. Please reopen this bug if reproduced in future.

Comment 11 Rachana Patel 2014-07-18 09:07:54 UTC
got this error once with build 3.6.0.24-1.el6rhs.x86_64

logs got cleared but will try to reproduce again and upload the logs.

Comment 12 Rachana Patel 2014-07-18 14:50:53 UTC
*** Bug 1115379 has been marked as a duplicate of this bug. ***

Comment 15 Susant Kumar Palai 2015-12-24 06:30:07 UTC
triage-update: Dev will test it out and take call after that.


Note You need to log in before you can comment on or make changes to this bug.