Bug 1575553

Summary: [geo-rep]: [Errno 39] Directory not empty
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rochelle <rallan>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Prasad Desala <tdesala>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: csaba, khiremat, rallan, rgowdapp, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vdas
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1579615 (view as bug list) Environment:
Last Closed: 2019-04-03 04:40:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1661258    
Bug Blocks: 1579615    

Description Rochelle 2018-05-07 09:44:01 UTC
Description of problem:
=======================
Ran automated test cases with 3x3 master volume and a 3x3 slave volume (Rsync + Fuse)

The geo-rep status was stuck in history crawl with some workers' status 'FAULTY'


10.70.43.228    master        /bricks/brick0/master_brick0    root          ssh://10.70.41.226::slave    10.70.41.228    Active    History Crawl    2018-05-07 06:31:33          
10.70.43.228    master        /bricks/brick1/master_brick6    root          ssh://10.70.41.226::slave    10.70.41.228    Active    History Crawl    2018-05-07 06:31:15          
10.70.41.229    master        /bricks/brick0/master_brick3    root          ssh://10.70.41.226::slave    10.70.41.228    Active    History Crawl    2018-05-07 06:31:30          
10.70.41.230    master        /bricks/brick0/master_brick4    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
10.70.41.219    master        /bricks/brick0/master_brick5    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
10.70.42.174    master        /bricks/brick0/master_brick2    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
10.70.42.174    master        /bricks/brick1/master_brick8    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
10.70.43.224    master        /bricks/brick0/master_brick1    root          ssh://10.70.41.226::slave    10.70.41.227    Active    History Crawl    2018-05-07 06:31:24          
10.70.43.224    master        /bricks/brick1/master_brick7    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
[root@dhcp43-228 master]# gluster v info



The worker crashed with 'Directory not empty':

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 802, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1676, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1470, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1370, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1114, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 228, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 210, in __call__
    raise res
OSError: [Errno 39] Directory not empty: '.gfid/b6c0b18a-8a5a-408b-88ec-a01fb88c8bfe/level46'



Version-Release number of selected component (if applicable):
=============================================================
root@dhcp43-228 master]# rpm -qa | grep gluster
glusterfs-server-3.12.2-8.el7rhgs.x86_64
glusterfs-api-3.12.2-8.el7rhgs.x86_64
glusterfs-rdma-3.12.2-8.el7rhgs.x86_64
glusterfs-cli-3.12.2-8.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-8.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-8.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-events-3.12.2-8.el7rhgs.x86_64
glusterfs-3.12.2-8.el7rhgs.x86_64
glusterfs-fuse-3.12.2-8.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-8.el7rhgs.x86_64
python2-gluster-3.12.2-8.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64



How reproducible:
=================
1/1


Actual results:
==============
The worker crashed with 'Directory not empty' tracebacks which flooded the logs.

Expected results:
================
There should be no crash

Comment 7 Raghavendra G 2018-05-21 10:05:48 UTC
Can we try to reproduce this issue setting following two options to the values specified (Since the issue is seen on slave, please set these options on slave volume)?

* diagnostics.client-log-level to TRACE
* diagnostics.brick-log-level to TRACE

Please attach brick and client logs to the bz

Comment 8 Raghavendra G 2018-06-25 03:20:03 UTC
Rochelle,

Is it possible to provide debug data asked in the previous email?

regards,
Raghavendra

Comment 16 Raghavendra G 2018-12-07 02:43:21 UTC
Since its a race and not much can be found from sos reports, there is no method other than code analysis to debug this issue.

I need following information when we hit this issue:
1. ls -l of the problematic directory on mount point
2. ls -l of the problematic directory on all bricks
3. all extended attributes of the problematic directory on all bricks
4. all extended attributes of any children of the problematic directory on all bricks

Since the automation run clears everything, there is no way to get this data. So, it would be of great help if we can capture the above information either through instrumentation in automation framework or to gsyncd.

Though I am planning to spend some cycles on analysing the related code in DHT (my hypothesis is a deleted subdirectory is recreated due to a race and is not visible in readdir issued from mount on parent directory), I am not much hopeful that it'll yield any positive results. We've recently fixed such races and my previous attempts at finding any loopholes in the synchronization algorithm didn't yield any positive results.

Comment 17 Raghavendra G 2019-02-12 10:58:01 UTC

*** This bug has been marked as a duplicate of bug 1661258 ***

Comment 18 Raghavendra G 2019-02-12 10:58:41 UTC
Also see bz 1458215