Bug 1458215

Summary: Slave reports ENOTEMPTY when rmdir is executed on master
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED ERRATA QA Contact: Rochelle <rallan>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.2CC: akrishna, atumball, bkunal, nbalacha, nchilaka, rallan, rhinduja, rhs-bugs, sheggodu, smali, spalai, srmukher, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.z Batch Update 4   
Hardware: x86_64   
OS: Linux   
Whiteboard: dht-rm-rf
Fixed In Version: glusterfs-3.12.2-47 Doc Type: Bug Fix
Doc Text:
Previously, a race condition in rm -rf operations could leave subdirectories behind on some bricks, causing the operation to fail with the error "Directory not empty". Subsequent rm -rf operations would continue to fail with the same error even though the directories seemed empty when listing their contents from the mount point. These invisible directories needed to be manually deleted on the bricks before the operation would succeed. With this update, the race condition is fixed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-27 03:43:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1676400, 1677260, 1695403    
Bug Blocks: 1661258, 1678183    

Description Rahul Hinduja 2017-06-02 10:21:56 UTC
Description of problem:
=======================

While running the geo-replication automation (snapshot+geo-rep) which does the following in sequence:

1. Creates geo-rep between master and slave
2. for i in {create,chmod,chown,chgrp,symlink,hardlink,truncate,rename,rm -rf} ;
      2.a: $i on master
      2.b: Let the sync happen to Slave
      2.c: Check the number of files to be equal via "find . | wc -l" on master and slave
      2.d: Once the count matches, calculate the arequal checksum
      2.e: Move to other fop

After the rm, the slave count do not match with the master and the errors reported as directory not empty:

[2017-06-01 14:28:02.448498] W [resource(slave):733:entry_ops] <top>: Recursive remove 9d197476-ed88-4bf4-8060-414d3a481599 => .gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05failed: Directory not empty
[2017-06-01 14:28:02.449425] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['9d197476-ed88-4bf4-8060-414d3a481599', '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05', '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05'])...[Errno 39] Directory not empty: '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05/level15'
[2017-06-01 14:28:02.449795] W [resource(slave):733:entry_ops] <top>: Recursive remove 9d197476-ed88-4bf4-8060-414d3a481599 => .gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05failed: Directory not empty
[2017-06-01 14:28:55.316672] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['59b8b057-fb3d-4d90-9fbc-8ef205dc1101', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level05/level15'
[2017-06-01 14:28:55.317033] W [resource(slave):733:entry_ops] <top>: Recursive remove 59b8b057-fb3d-4d90-9fbc-8ef205dc1101 => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty
[2017-06-01 14:28:55.331442] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['59b8b057-fb3d-4d90-9fbc-8ef205dc1101', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level05/level15'
[2017-06-01 14:28:55.331787] W [resource(slave):733:entry_ops] <top>: Recursive remove 59b8b057-fb3d-4d90-9fbc-8ef205dc1101 => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty


Directory structure of the slave is:


[root@dhcp42-10 slave]# ls -lR
.:
total 4
drwxr-xr-x. 3 root root 4096 Jun  1 19:58 thread1

./thread1:
total 4
drwxr-xr-x. 3 root root 4096 Jun  1 19:57 level05

./thread1/level05:
total 4
drwx-wxr-x. 2 42131 16284 4096 Jun  1 19:57 level15

./thread1/level05/level15:
total 0
[root@dhcp42-10 slave]#

ls on the absolute path and than removal resolves the issue. 
    

Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.8.4-18.4.el7rhgs.x86_64


How reproducible:
=================

Rare, seen it once in whole 3.2.0 and again in 3.2.0_async. Total number of times this case would have been executed > 30

Comment 4 Nithya Balachandran 2017-06-06 07:57:41 UTC
After discussing this with Rahul, I am moving this to 3.3.0-beyond. 
Rahul will try to reproduce this during the regression cycles after enabling debug logs.

Comment 30 errata-xmlrpc 2019-03-27 03:43:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0658