Previously, a race condition in rm -rf operations could leave subdirectories behind on some bricks, causing the operation to fail with the error "Directory not empty". Subsequent rm -rf operations would continue to fail with the same error even though the directories seemed empty when listing their contents from the mount point. These invisible directories needed to be manually deleted on the bricks before the operation would succeed. With this update, the race condition is fixed.
Description of problem:
=======================
While running the geo-replication automation (snapshot+geo-rep) which does the following in sequence:
1. Creates geo-rep between master and slave
2. for i in {create,chmod,chown,chgrp,symlink,hardlink,truncate,rename,rm -rf} ;
2.a: $i on master
2.b: Let the sync happen to Slave
2.c: Check the number of files to be equal via "find . | wc -l" on master and slave
2.d: Once the count matches, calculate the arequal checksum
2.e: Move to other fop
After the rm, the slave count do not match with the master and the errors reported as directory not empty:
[2017-06-01 14:28:02.448498] W [resource(slave):733:entry_ops] <top>: Recursive remove 9d197476-ed88-4bf4-8060-414d3a481599 => .gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05failed: Directory not empty
[2017-06-01 14:28:02.449425] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['9d197476-ed88-4bf4-8060-414d3a481599', '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05', '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05'])...[Errno 39] Directory not empty: '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05/level15'
[2017-06-01 14:28:02.449795] W [resource(slave):733:entry_ops] <top>: Recursive remove 9d197476-ed88-4bf4-8060-414d3a481599 => .gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05failed: Directory not empty
[2017-06-01 14:28:55.316672] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['59b8b057-fb3d-4d90-9fbc-8ef205dc1101', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level05/level15'
[2017-06-01 14:28:55.317033] W [resource(slave):733:entry_ops] <top>: Recursive remove 59b8b057-fb3d-4d90-9fbc-8ef205dc1101 => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty
[2017-06-01 14:28:55.331442] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['59b8b057-fb3d-4d90-9fbc-8ef205dc1101', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level05/level15'
[2017-06-01 14:28:55.331787] W [resource(slave):733:entry_ops] <top>: Recursive remove 59b8b057-fb3d-4d90-9fbc-8ef205dc1101 => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty
Directory structure of the slave is:
[root@dhcp42-10 slave]# ls -lR
.:
total 4
drwxr-xr-x. 3 root root 4096 Jun 1 19:58 thread1
./thread1:
total 4
drwxr-xr-x. 3 root root 4096 Jun 1 19:57 level05
./thread1/level05:
total 4
drwx-wxr-x. 2 42131 16284 4096 Jun 1 19:57 level15
./thread1/level05/level15:
total 0
[root@dhcp42-10 slave]#
ls on the absolute path and than removal resolves the issue.
Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-18.4.el7rhgs.x86_64
How reproducible:
=================
Rare, seen it once in whole 3.2.0 and again in 3.2.0_async. Total number of times this case would have been executed > 30
Comment 4Nithya Balachandran
2017-06-06 07:57:41 UTC
After discussing this with Rahul, I am moving this to 3.3.0-beyond.
Rahul will try to reproduce this during the regression cycles after enabling debug logs.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2019:0658
Description of problem: ======================= While running the geo-replication automation (snapshot+geo-rep) which does the following in sequence: 1. Creates geo-rep between master and slave 2. for i in {create,chmod,chown,chgrp,symlink,hardlink,truncate,rename,rm -rf} ; 2.a: $i on master 2.b: Let the sync happen to Slave 2.c: Check the number of files to be equal via "find . | wc -l" on master and slave 2.d: Once the count matches, calculate the arequal checksum 2.e: Move to other fop After the rm, the slave count do not match with the master and the errors reported as directory not empty: [2017-06-01 14:28:02.448498] W [resource(slave):733:entry_ops] <top>: Recursive remove 9d197476-ed88-4bf4-8060-414d3a481599 => .gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05failed: Directory not empty [2017-06-01 14:28:02.449425] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['9d197476-ed88-4bf4-8060-414d3a481599', '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05', '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05'])...[Errno 39] Directory not empty: '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05/level15' [2017-06-01 14:28:02.449795] W [resource(slave):733:entry_ops] <top>: Recursive remove 9d197476-ed88-4bf4-8060-414d3a481599 => .gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05failed: Directory not empty [2017-06-01 14:28:55.316672] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['59b8b057-fb3d-4d90-9fbc-8ef205dc1101', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level05/level15' [2017-06-01 14:28:55.317033] W [resource(slave):733:entry_ops] <top>: Recursive remove 59b8b057-fb3d-4d90-9fbc-8ef205dc1101 => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty [2017-06-01 14:28:55.331442] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['59b8b057-fb3d-4d90-9fbc-8ef205dc1101', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level05/level15' [2017-06-01 14:28:55.331787] W [resource(slave):733:entry_ops] <top>: Recursive remove 59b8b057-fb3d-4d90-9fbc-8ef205dc1101 => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty Directory structure of the slave is: [root@dhcp42-10 slave]# ls -lR .: total 4 drwxr-xr-x. 3 root root 4096 Jun 1 19:58 thread1 ./thread1: total 4 drwxr-xr-x. 3 root root 4096 Jun 1 19:57 level05 ./thread1/level05: total 4 drwx-wxr-x. 2 42131 16284 4096 Jun 1 19:57 level15 ./thread1/level05/level15: total 0 [root@dhcp42-10 slave]# ls on the absolute path and than removal resolves the issue. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.8.4-18.4.el7rhgs.x86_64 How reproducible: ================= Rare, seen it once in whole 3.2.0 and again in 3.2.0_async. Total number of times this case would have been executed > 30