Bug 1575553

Summary: [geo-rep]: [Errno 39] Directory not empty
Product: Red Hat Gluster Storage Reporter: Rochelle <rallan>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Prasad Desala <tdesala>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: csaba, khiremat, rallan, rgowdapp, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vdas
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1579615 (view as bug list) Environment:
Last Closed: 2019-04-03 04:40:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1661258    
Bug Blocks: 1579615    

Description Rochelle 2018-05-07 09:44:01 UTC
Description of problem:
Ran automated test cases with 3x3 master volume and a 3x3 slave volume (Rsync + Fuse)

The geo-rep status was stuck in history crawl with some workers' status 'FAULTY'    master        /bricks/brick0/master_brick0    root          ssh://    Active    History Crawl    2018-05-07 06:31:33     master        /bricks/brick1/master_brick6    root          ssh://    Active    History Crawl    2018-05-07 06:31:15     master        /bricks/brick0/master_brick3    root          ssh://    Active    History Crawl    2018-05-07 06:31:30     master        /bricks/brick0/master_brick4    root          ssh://    N/A             Faulty    N/A              N/A                     master        /bricks/brick0/master_brick5    root          ssh://    N/A             Faulty    N/A              N/A                     master        /bricks/brick0/master_brick2    root          ssh://    N/A             Faulty    N/A              N/A                     master        /bricks/brick1/master_brick8    root          ssh://    N/A             Faulty    N/A              N/A                     master        /bricks/brick0/master_brick1    root          ssh://    Active    History Crawl    2018-05-07 06:31:24     master        /bricks/brick1/master_brick7    root          ssh://    N/A             Faulty    N/A              N/A                          
[root@dhcp43-228 master]# gluster v info

The worker crashed with 'Directory not empty':

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, in main
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 802, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1676, in service_loop
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, in crawlwrap
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1470, in crawl
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1370, in changelogs_batch_process
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1114, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 228, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 210, in __call__
    raise res
OSError: [Errno 39] Directory not empty: '.gfid/b6c0b18a-8a5a-408b-88ec-a01fb88c8bfe/level46'

Version-Release number of selected component (if applicable):
root@dhcp43-228 master]# rpm -qa | grep gluster

How reproducible:

Actual results:
The worker crashed with 'Directory not empty' tracebacks which flooded the logs.

Expected results:
There should be no crash

Comment 7 Raghavendra G 2018-05-21 10:05:48 UTC
Can we try to reproduce this issue setting following two options to the values specified (Since the issue is seen on slave, please set these options on slave volume)?

* diagnostics.client-log-level to TRACE
* diagnostics.brick-log-level to TRACE

Please attach brick and client logs to the bz

Comment 8 Raghavendra G 2018-06-25 03:20:03 UTC

Is it possible to provide debug data asked in the previous email?


Comment 16 Raghavendra G 2018-12-07 02:43:21 UTC
Since its a race and not much can be found from sos reports, there is no method other than code analysis to debug this issue.

I need following information when we hit this issue:
1. ls -l of the problematic directory on mount point
2. ls -l of the problematic directory on all bricks
3. all extended attributes of the problematic directory on all bricks
4. all extended attributes of any children of the problematic directory on all bricks

Since the automation run clears everything, there is no way to get this data. So, it would be of great help if we can capture the above information either through instrumentation in automation framework or to gsyncd.

Though I am planning to spend some cycles on analysing the related code in DHT (my hypothesis is a deleted subdirectory is recreated due to a race and is not visible in readdir issued from mount on parent directory), I am not much hopeful that it'll yield any positive results. We've recently fixed such races and my previous attempts at finding any loopholes in the synchronization algorithm didn't yield any positive results.

Comment 17 Raghavendra G 2019-02-12 10:58:01 UTC

*** This bug has been marked as a duplicate of bug 1661258 ***

Comment 18 Raghavendra G 2019-02-12 10:58:41 UTC
Also see bz 1458215