Bug 1235633

Summary:	[rm -rf]: Recursive remove fails with "[Errno 39] Directory not empty" on slave when performed rm -rf on Master volume
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	distribute	Assignee:	Sakshi <sabansal>
Status:	CLOSED DUPLICATE	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	annair, asriram, asrivast, nbalacha, rhs-bugs, sabansal, sankarshan, smohan, spalai
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-03-15 10:31:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1245065, 1257894
Bug Blocks:	1216951, 1238068

Description Rahul Hinduja 2015-06-25 11:28:13 UTC

Description of problem:


In a consistent scenario, when rm -rf is performed on Master volume (Fuse/NFS). The slave logs the below errors and fails to remove from the slave volume. Geo-Rep continue to retry removal and after a while the files/directories do get remove. 

[2015-06-24 17:10:10.844609] W [resource(slave):692:entry_ops] <top>: Recursive remove 270bb38f-fd2e-4cad-af38-200beb35fd68 => .gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profilesfailed: Directory not empty
[2015-06-24 17:10:10.857244] W [syncdutils(slave):486:errno_wrap] <top>: reached maximum retries (['270bb38f-fd2e-4cad-af38-200beb35fd68', '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles', '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles'])...[Errno 39] Directory not empty: '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles'
[2015-06-24 17:10:10.857528] W [resource(slave):692:entry_ops] <top>: Recursive remove 270bb38f-fd2e-4cad-af38-200beb35fd68 => .gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profilesfailed: Directory not empty
[2015-06-24 17:10:13.361917] W [syncdutils(slave):486:errno_wrap] <top>: reached maximum retries (['270bb38f-fd2e-4cad-af38-200beb35fd68', '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles', '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles'])...[Errno 39] Directory not empty: '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles'
[2015-06-24 17:10:13.362207] W [resource(slave):692:entry_ops] <top>: Recursive remove 270bb38f-fd2e-4cad-af38-200beb35fd68 => .gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profilesfailed: Directory not empty
[2015-06-24 17:10:18.390331] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 685, in entry_ops
    [], [ENOTEMPTY, ESTALE, ENODATA])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 667, in recursive_rmdir
    errno_wrap(os.rmdir, [path], [ENOENT, ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/alternatives'
[2015-06-24 17:10:18.398015] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-06-24 17:10:18.398405] I [syncdutils(slave):220:finalize] <top>: exiting.



Other Errors logged are:
=========================
grep "OSError" /var/log/glusterfs/geo-replication-slaves/9c0db153-6b18-4b92-bcbd-8448fba042ce\:gluster%3A%2F%2F127.0.0.1%3Aslave.log


OSError: [Errno 107] Transport endpoint is not connected: '.gfid/00546903-6a61-4ede-a703-7a00a5f3b22f/X11/fontpath.d'
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/72fc70a8-ecad-4f2e-80a6-605ab1d5681e/redhat-lsb'
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 117] Structure needs cleaning
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/547f2de5-7971-4323-837e-6ecf308a36c9/cluster/cman-notify.d'
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 117] Structure needs cleaning
OSError: [Errno 117] Structure needs cleaning: '.gfid/53c7d4b5-a4cb-4b77-bac8-d9476b77dec1/rhsm/pluginconf.d'


Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.7.1-5.el6rhs.x86_64


How reproducible:
=================

Always


Steps to Reproduce:
===================
1. Create Master Cluster with 4 nodes
2. Create Slave Cluster with 2 nodes
3. Create and Start Master volume (4x2)
4. Create and Start Slave volume (2x2)
5. Create and Start Meta Volume (1x3)
6. Create password-less ssh between node1 of master to node1 of slave
7. Create geo-rep session between master and slave
8. Config the session to use_meta_volume  true
9. Start the geo-rep session
10. Mount the master and slave volume on client (Fuse & NFS)
11. From the fuse mount of master volume create data. I used:

for i in {1..10}; do cp -rf /etc etc.$i ; done
for i in {1..100}; do dd if=/dev/zero of=$i bs=10M count=1 ; done
for i in {1..10}; do cp -rf /etc r$i ; done

12. From NFS mount of master volume create data. I used:

for i in {11..20}; do cp -rf /etc arm.$i ; done
for i in {1..200}; do dd if=/dev/zero of=nfs.$i bs=1M count=1 ; done

13. Wait for files to sync to slave. Mount the slave volume and check arequal/ ls -lRT | wc etc.

14. Once the files are synced successfully. Do "rm -rf arm.*" from fuse mount and "rm -rf r*"

In a while you should start seeing lot of errors on Master log file and Slave log file.

Master File Location:
=====================

/var/log/glusterfs/geo-replication/master/

Slave File Location:
====================

/var/log/glusterfs/geo-replication-slaves/

Comment 3 Sakshi 2015-06-30 09:31:51 UTC

One of the main issue of "Directory not empty" error on slave is the race between the changelogs which are written to the slave. 

Eg: A volume has 2 subvols. There is a single directory dir1 with a single file file1 hashing to subvol2.

changelog for subvol1 has - rmdir(Dir1)
changelog for subvol2 has - rm file1 followed by rmdir(Dir1)

However if changelog for subvol1 is carried out before subvol2, it would result in deleting Dir1, without deleting file1. Hence we get "Directory not empty" error on slave.

Comment 6 monti lawrence 2015-07-22 19:19:55 UTC

Doc text is edited. Please sign off to be included in Known Issues.

Comment 8 Anjana Suparna Sriram 2015-07-28 02:57:21 UTC

Included the edited text.

Comment 14 Rahul Hinduja 2016-01-13 10:19:53 UTC

For records: Hitting this bug with build: glusterfs-3.7.5-15.el7rhgs.x86_64 {3.1.2}

Comment 16 Satish Mohan 2016-03-15 10:31:35 UTC


*** This bug has been marked as a duplicate of bug 1310194 ***