1458215 – Slave reports ENOTEMPTY when rmdir is executed on master

Bug 1458215 - Slave reports ENOTEMPTY when rmdir is executed on master

Summary: Slave reports ENOTEMPTY when rmdir is executed on master

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 4
Assignee:	Nithya Balachandran
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:	dht-rm-rf
Depends On:	1676400 1677260 1695403
Blocks:	1661258 1678183
TreeView+	depends on / blocked

Reported:	2017-06-02 10:21 UTC by Rahul Hinduja
Modified:	2019-04-03 04:10 UTC (History)
CC List:	13 users (show)
Fixed In Version:	glusterfs-3.12.2-47
Doc Type:	Bug Fix
Doc Text:	Previously, a race condition in rm -rf operations could leave subdirectories behind on some bricks, causing the operation to fail with the error "Directory not empty". Subsequent rm -rf operations would continue to fail with the same error even though the directories seemed empty when listing their contents from the mount point. These invisible directories needed to be manually deleted on the bricks before the operation would succeed. With this update, the race condition is fixed.
Clone Of:
Environment:
Last Closed:	2019-03-27 03:43:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rahul Hinduja 2017-06-02 10:21:56 UTC

Description of problem:
=======================

While running the geo-replication automation (snapshot+geo-rep) which does the following in sequence:

1. Creates geo-rep between master and slave
2. for i in {create,chmod,chown,chgrp,symlink,hardlink,truncate,rename,rm -rf} ;
      2.a: $i on master
      2.b: Let the sync happen to Slave
      2.c: Check the number of files to be equal via "find . | wc -l" on master and slave
      2.d: Once the count matches, calculate the arequal checksum
      2.e: Move to other fop

After the rm, the slave count do not match with the master and the errors reported as directory not empty:

[2017-06-01 14:28:02.448498] W [resource(slave):733:entry_ops] <top>: Recursive remove 9d197476-ed88-4bf4-8060-414d3a481599 => .gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05failed: Directory not empty
[2017-06-01 14:28:02.449425] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['9d197476-ed88-4bf4-8060-414d3a481599', '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05', '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05'])...[Errno 39] Directory not empty: '.gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05/level15'
[2017-06-01 14:28:02.449795] W [resource(slave):733:entry_ops] <top>: Recursive remove 9d197476-ed88-4bf4-8060-414d3a481599 => .gfid/59b8b057-fb3d-4d90-9fbc-8ef205dc1101/level05failed: Directory not empty
[2017-06-01 14:28:55.316672] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['59b8b057-fb3d-4d90-9fbc-8ef205dc1101', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level05/level15'
[2017-06-01 14:28:55.317033] W [resource(slave):733:entry_ops] <top>: Recursive remove 59b8b057-fb3d-4d90-9fbc-8ef205dc1101 => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty
[2017-06-01 14:28:55.331442] W [syncdutils(slave):506:errno_wrap] <top>: reached maximum retries (['59b8b057-fb3d-4d90-9fbc-8ef205dc1101', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level05/level15'
[2017-06-01 14:28:55.331787] W [resource(slave):733:entry_ops] <top>: Recursive remove 59b8b057-fb3d-4d90-9fbc-8ef205dc1101 => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty


Directory structure of the slave is:


[root@dhcp42-10 slave]# ls -lR
.:
total 4
drwxr-xr-x. 3 root root 4096 Jun  1 19:58 thread1

./thread1:
total 4
drwxr-xr-x. 3 root root 4096 Jun  1 19:57 level05

./thread1/level05:
total 4
drwx-wxr-x. 2 42131 16284 4096 Jun  1 19:57 level15

./thread1/level05/level15:
total 0
[root@dhcp42-10 slave]#

ls on the absolute path and than removal resolves the issue. 
    

Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.8.4-18.4.el7rhgs.x86_64


How reproducible:
=================

Rare, seen it once in whole 3.2.0 and again in 3.2.0_async. Total number of times this case would have been executed > 30

Comment 4 Nithya Balachandran 2017-06-06 07:57:41 UTC

After discussing this with Rahul, I am moving this to 3.3.0-beyond. 
Rahul will try to reproduce this during the regression cycles after enabling debug logs.

Comment 30 errata-xmlrpc 2019-03-27 03:43:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0658

Note You need to log in before you can comment on or make changes to this bug.