Description of problem:
In a consistent scenario, when rm -rf is performed on Master volume (Fuse/NFS). The slave logs the below errors and fails to remove from the slave volume. Geo-Rep continue to retry removal and after a while the files/directories do get remove.
[2015-06-24 17:10:10.844609] W [resource(slave):692:entry_ops] <top>: Recursive remove 270bb38f-fd2e-4cad-af38-200beb35fd68 => .gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profilesfailed: Directory not empty
[2015-06-24 17:10:10.857244] W [syncdutils(slave):486:errno_wrap] <top>: reached maximum retries (['270bb38f-fd2e-4cad-af38-200beb35fd68', '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles', '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles'])...[Errno 39] Directory not empty: '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles'
[2015-06-24 17:10:10.857528] W [resource(slave):692:entry_ops] <top>: Recursive remove 270bb38f-fd2e-4cad-af38-200beb35fd68 => .gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profilesfailed: Directory not empty
[2015-06-24 17:10:13.361917] W [syncdutils(slave):486:errno_wrap] <top>: reached maximum retries (['270bb38f-fd2e-4cad-af38-200beb35fd68', '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles', '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles'])...[Errno 39] Directory not empty: '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profiles'
[2015-06-24 17:10:13.362207] W [resource(slave):692:entry_ops] <top>: Recursive remove 270bb38f-fd2e-4cad-af38-200beb35fd68 => .gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/tune-profilesfailed: Directory not empty
[2015-06-24 17:10:18.390331] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
res = getattr(self.obj, rmeth)(*in_data[2:])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 685, in entry_ops
[], [ENOTEMPTY, ESTALE, ENODATA])
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
return call(*arg)
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 667, in recursive_rmdir
errno_wrap(os.rmdir, [path], [ENOENT, ESTALE])
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/e32ea6ee-9f46-46f2-8816-51648960fc0f/alternatives'
[2015-06-24 17:10:18.398015] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-06-24 17:10:18.398405] I [syncdutils(slave):220:finalize] <top>: exiting.
Other Errors logged are:
=========================
grep "OSError" /var/log/glusterfs/geo-replication-slaves/9c0db153-6b18-4b92-bcbd-8448fba042ce\:gluster%3A%2F%2F127.0.0.1%3Aslave.log
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/00546903-6a61-4ede-a703-7a00a5f3b22f/X11/fontpath.d'
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/72fc70a8-ecad-4f2e-80a6-605ab1d5681e/redhat-lsb'
raise OSError(errn, os.strerror(errn))
OSError: [Errno 117] Structure needs cleaning
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/547f2de5-7971-4323-837e-6ecf308a36c9/cluster/cman-notify.d'
raise OSError(errn, os.strerror(errn))
OSError: [Errno 117] Structure needs cleaning
OSError: [Errno 117] Structure needs cleaning: '.gfid/53c7d4b5-a4cb-4b77-bac8-d9476b77dec1/rhsm/pluginconf.d'
Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.7.1-5.el6rhs.x86_64
How reproducible:
=================
Always
Steps to Reproduce:
===================
1. Create Master Cluster with 4 nodes
2. Create Slave Cluster with 2 nodes
3. Create and Start Master volume (4x2)
4. Create and Start Slave volume (2x2)
5. Create and Start Meta Volume (1x3)
6. Create password-less ssh between node1 of master to node1 of slave
7. Create geo-rep session between master and slave
8. Config the session to use_meta_volume true
9. Start the geo-rep session
10. Mount the master and slave volume on client (Fuse & NFS)
11. From the fuse mount of master volume create data. I used:
for i in {1..10}; do cp -rf /etc etc.$i ; done
for i in {1..100}; do dd if=/dev/zero of=$i bs=10M count=1 ; done
for i in {1..10}; do cp -rf /etc r$i ; done
12. From NFS mount of master volume create data. I used:
for i in {11..20}; do cp -rf /etc arm.$i ; done
for i in {1..200}; do dd if=/dev/zero of=nfs.$i bs=1M count=1 ; done
13. Wait for files to sync to slave. Mount the slave volume and check arequal/ ls -lRT | wc etc.
14. Once the files are synced successfully. Do "rm -rf arm.*" from fuse mount and "rm -rf r*"
In a while you should start seeing lot of errors on Master log file and Slave log file.
Master File Location:
=====================
/var/log/glusterfs/geo-replication/master/
Slave File Location:
====================
/var/log/glusterfs/geo-replication-slaves/
One of the main issue of "Directory not empty" error on slave is the race between the changelogs which are written to the slave.
Eg: A volume has 2 subvols. There is a single directory dir1 with a single file file1 hashing to subvol2.
changelog for subvol1 has - rmdir(Dir1)
changelog for subvol2 has - rm file1 followed by rmdir(Dir1)
However if changelog for subvol1 is carried out before subvol2, it would result in deleting Dir1, without deleting file1. Hence we get "Directory not empty" error on slave.