Bug 1575553 - [geo-rep]: [Errno 39] Directory not empty
Summary: [geo-rep]: [Errno 39] Directory not empty
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Nithya Balachandran
QA Contact: Prasad Desala
URL:
Whiteboard:
Depends On: 1661258
Blocks: 1579615
TreeView+ depends on / blocked
 
Reported: 2018-05-07 09:44 UTC by Rochelle
Modified: 2019-06-21 02:41 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1579615 (view as bug list)
Environment:
Last Closed: 2019-04-03 04:40:33 UTC
Embargoed:


Attachments (Terms of Use)

Description Rochelle 2018-05-07 09:44:01 UTC
Description of problem:
=======================
Ran automated test cases with 3x3 master volume and a 3x3 slave volume (Rsync + Fuse)

The geo-rep status was stuck in history crawl with some workers' status 'FAULTY'


10.70.43.228    master        /bricks/brick0/master_brick0    root          ssh://10.70.41.226::slave    10.70.41.228    Active    History Crawl    2018-05-07 06:31:33          
10.70.43.228    master        /bricks/brick1/master_brick6    root          ssh://10.70.41.226::slave    10.70.41.228    Active    History Crawl    2018-05-07 06:31:15          
10.70.41.229    master        /bricks/brick0/master_brick3    root          ssh://10.70.41.226::slave    10.70.41.228    Active    History Crawl    2018-05-07 06:31:30          
10.70.41.230    master        /bricks/brick0/master_brick4    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
10.70.41.219    master        /bricks/brick0/master_brick5    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
10.70.42.174    master        /bricks/brick0/master_brick2    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
10.70.42.174    master        /bricks/brick1/master_brick8    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
10.70.43.224    master        /bricks/brick0/master_brick1    root          ssh://10.70.41.226::slave    10.70.41.227    Active    History Crawl    2018-05-07 06:31:24          
10.70.43.224    master        /bricks/brick1/master_brick7    root          ssh://10.70.41.226::slave    N/A             Faulty    N/A              N/A                          
[root@dhcp43-228 master]# gluster v info



The worker crashed with 'Directory not empty':

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 802, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1676, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1470, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1370, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1114, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 228, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 210, in __call__
    raise res
OSError: [Errno 39] Directory not empty: '.gfid/b6c0b18a-8a5a-408b-88ec-a01fb88c8bfe/level46'



Version-Release number of selected component (if applicable):
=============================================================
root@dhcp43-228 master]# rpm -qa | grep gluster
glusterfs-server-3.12.2-8.el7rhgs.x86_64
glusterfs-api-3.12.2-8.el7rhgs.x86_64
glusterfs-rdma-3.12.2-8.el7rhgs.x86_64
glusterfs-cli-3.12.2-8.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-8.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-8.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-events-3.12.2-8.el7rhgs.x86_64
glusterfs-3.12.2-8.el7rhgs.x86_64
glusterfs-fuse-3.12.2-8.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-8.el7rhgs.x86_64
python2-gluster-3.12.2-8.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64



How reproducible:
=================
1/1


Actual results:
==============
The worker crashed with 'Directory not empty' tracebacks which flooded the logs.

Expected results:
================
There should be no crash

Comment 7 Raghavendra G 2018-05-21 10:05:48 UTC
Can we try to reproduce this issue setting following two options to the values specified (Since the issue is seen on slave, please set these options on slave volume)?

* diagnostics.client-log-level to TRACE
* diagnostics.brick-log-level to TRACE

Please attach brick and client logs to the bz

Comment 8 Raghavendra G 2018-06-25 03:20:03 UTC
Rochelle,

Is it possible to provide debug data asked in the previous email?

regards,
Raghavendra

Comment 16 Raghavendra G 2018-12-07 02:43:21 UTC
Since its a race and not much can be found from sos reports, there is no method other than code analysis to debug this issue.

I need following information when we hit this issue:
1. ls -l of the problematic directory on mount point
2. ls -l of the problematic directory on all bricks
3. all extended attributes of the problematic directory on all bricks
4. all extended attributes of any children of the problematic directory on all bricks

Since the automation run clears everything, there is no way to get this data. So, it would be of great help if we can capture the above information either through instrumentation in automation framework or to gsyncd.

Though I am planning to spend some cycles on analysing the related code in DHT (my hypothesis is a deleted subdirectory is recreated due to a race and is not visible in readdir issued from mount on parent directory), I am not much hopeful that it'll yield any positive results. We've recently fixed such races and my previous attempts at finding any loopholes in the synchronization algorithm didn't yield any positive results.

Comment 17 Raghavendra G 2019-02-12 10:58:01 UTC

*** This bug has been marked as a duplicate of bug 1661258 ***

Comment 18 Raghavendra G 2019-02-12 10:58:41 UTC
Also see bz 1458215


Note You need to log in before you can comment on or make changes to this bug.