Description of problem: ======================= Ran automated test cases with 3x3 master volume and a 3x3 slave volume (Rsync + Fuse) The geo-rep status was stuck in history crawl with some workers' status 'FAULTY' 10.70.43.228 master /bricks/brick0/master_brick0 root ssh://10.70.41.226::slave 10.70.41.228 Active History Crawl 2018-05-07 06:31:33 10.70.43.228 master /bricks/brick1/master_brick6 root ssh://10.70.41.226::slave 10.70.41.228 Active History Crawl 2018-05-07 06:31:15 10.70.41.229 master /bricks/brick0/master_brick3 root ssh://10.70.41.226::slave 10.70.41.228 Active History Crawl 2018-05-07 06:31:30 10.70.41.230 master /bricks/brick0/master_brick4 root ssh://10.70.41.226::slave N/A Faulty N/A N/A 10.70.41.219 master /bricks/brick0/master_brick5 root ssh://10.70.41.226::slave N/A Faulty N/A N/A 10.70.42.174 master /bricks/brick0/master_brick2 root ssh://10.70.41.226::slave N/A Faulty N/A N/A 10.70.42.174 master /bricks/brick1/master_brick8 root ssh://10.70.41.226::slave N/A Faulty N/A N/A 10.70.43.224 master /bricks/brick0/master_brick1 root ssh://10.70.41.226::slave 10.70.41.227 Active History Crawl 2018-05-07 06:31:24 10.70.43.224 master /bricks/brick1/master_brick7 root ssh://10.70.41.226::slave N/A Faulty N/A N/A [root@dhcp43-228 master]# gluster v info The worker crashed with 'Directory not empty': Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 802, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1676, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1470, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1370, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1114, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 228, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 210, in __call__ raise res OSError: [Errno 39] Directory not empty: '.gfid/b6c0b18a-8a5a-408b-88ec-a01fb88c8bfe/level46' Version-Release number of selected component (if applicable): ============================================================= root@dhcp43-228 master]# rpm -qa | grep gluster glusterfs-server-3.12.2-8.el7rhgs.x86_64 glusterfs-api-3.12.2-8.el7rhgs.x86_64 glusterfs-rdma-3.12.2-8.el7rhgs.x86_64 glusterfs-cli-3.12.2-8.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-libs-3.12.2-8.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-8.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64 vdsm-gluster-4.19.43-2.3.el7rhgs.noarch glusterfs-events-3.12.2-8.el7rhgs.x86_64 glusterfs-3.12.2-8.el7rhgs.x86_64 glusterfs-fuse-3.12.2-8.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-8.el7rhgs.x86_64 python2-gluster-3.12.2-8.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 How reproducible: ================= 1/1 Actual results: ============== The worker crashed with 'Directory not empty' tracebacks which flooded the logs. Expected results: ================ There should be no crash
Can we try to reproduce this issue setting following two options to the values specified (Since the issue is seen on slave, please set these options on slave volume)? * diagnostics.client-log-level to TRACE * diagnostics.brick-log-level to TRACE Please attach brick and client logs to the bz
Rochelle, Is it possible to provide debug data asked in the previous email? regards, Raghavendra
Since its a race and not much can be found from sos reports, there is no method other than code analysis to debug this issue. I need following information when we hit this issue: 1. ls -l of the problematic directory on mount point 2. ls -l of the problematic directory on all bricks 3. all extended attributes of the problematic directory on all bricks 4. all extended attributes of any children of the problematic directory on all bricks Since the automation run clears everything, there is no way to get this data. So, it would be of great help if we can capture the above information either through instrumentation in automation framework or to gsyncd. Though I am planning to spend some cycles on analysing the related code in DHT (my hypothesis is a deleted subdirectory is recreated due to a race and is not visible in readdir issued from mount on parent directory), I am not much hopeful that it'll yield any positive results. We've recently fixed such races and my previous attempts at finding any loopholes in the synchronization algorithm didn't yield any positive results.
*** This bug has been marked as a duplicate of bug 1661258 ***
Also see bz 1458215