Bug 1201732

Summary:	[dist-geo-rep]:Directory not empty and Stale file handle errors in geo-rep logs during deletes from master in history/changelog crawl
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	shilpa <smanjara>
Component:	geo-replication	Assignee:	Aravinda VK <avishwan>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	rhgs-3.0	CC:	aavati, annair, avishwan, csaba, nlevinki, rhinduja, vagarwal
Target Milestone:	---
Target Release:	RHGS 3.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.7.0-2.el6rhs	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1211037 (view as bug list)		Environment:
Last Closed:	2015-07-29 04:39:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1202842, 1211037, 1218922, 1223636

Description shilpa 2015-03-13 11:23:46 UTC

Description of problem:
Seeing Directory not empty and Stale file handle errors in geo-rep logs during history crawl while deleting files. Geo-rep session went faulty. 


Version-Release number of selected component (if applicable):
glusterfs-3.6.0.51-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create and start geo-rep session. 
2. Create data(10k files) on the master mount.
3. Ensure files are synced to slave volume
4. Now start deleting the files. And while deletes are running stop geo-rep session. 
5. Once all the deletes are finished on master volume, start the geo-rep session. 
6. The session will switch to history crawl and start syncing to slave volume. 
7. On the master node, check geo-rep slave.log and look for errors and traceback. 


Actual results:
Found errors like "Stale file handle: '.gfid/cee167df-dfc0-406a-bd0d-ed593faeaa9c/5502b36a%%JZKEPVPX8R' and OSError: [Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread4'

However the data was synced to slave. But geo-rep session went to faulty state but came back active after a while.

Expected results:

We should not see these traceback errors in the logs 

Additional info:
[2015-03-13 15:48:23.847239] E [repce(/bricks/brick0/master_brick0):207:__call__] RepceClient: call 16550:140374215104256:1426241903.22 (entry_ops) failed on peer with 
OSError
[2015-03-13 15:48:23.847932] E [syncdutils(/bricks/brick0/master_brick0):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 645, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1329, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 553, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1334, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1017, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 981, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread4'

[2015-03-13 15:48:39.740008] E [repce(/bricks/brick0/master_brick0):207:__call__] RepceClient: call 16662:140258876471040:1426241919.52 (entry_ops) failed on peer with OSError
[2015-03-13 15:48:39.740557] E [syncdutils(/bricks/brick0/master_brick0):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 645, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1329, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 553, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1334, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1017, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 981, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 116] Stale file handle: '.gfid/cee167df-dfc0-406a-bd0d-ed593faeaa9c/5502b36a%%JZKEPVPX8R'

[2015-03-13 15:46:27.528390] I [syncdutils(/bricks/brick0/master_brick0):214:finalize] <top>: exiting.
[2015-03-13 15:46:27.532459] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-03-13 15:46:27.532964] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-03-13 15:46:28.258347] I [monitor(monitor):280:monitor] Monitor: worker(/bricks/brick0/master_brick0) died in startup phase
[2015-03-13 15:46:28.259366] I [monitor(monitor):146:set_state] Monitor: new state: faulty
[2015-03-13 15:46:38.598536] I [monitor(monitor):220:monitor] Monitor: ------------------------------------------------------------
[2015-03-13 15:46:38.599097] I [monitor(monitor):221:monitor] Monitor: starting gsyncd worker
[2015-03-13 15:46:38.887062] I [gsyncd(/bricks/brick0/master_brick0):635:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@falcon:gluster://localhost:sla
ve
[2015-03-13 15:46:38.901046] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...


# gluster v i
 
Volume Name: master
Type: Distributed-Replicate
Volume ID: 82b646f9-1f38-443f-862d-dadd24b668c5
Status: Started
Snap Volume: no
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: spitfire:/bricks/brick0/master_brick0
Brick2: mustang:/bricks/brick0/master_brick1
Brick3: harrier:/bricks/brick0/master_brick2
Brick4: typhoon:/bricks/brick0/master_brick3
Brick5: ccr:/bricks/brick0/master_brick4
Brick6: metallica:/bricks/brick0/master_brick5
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.quota: on
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

# gluster v geo-rep master falcon::slave status 
 
MASTER NODE                 MASTER VOL    MASTER BRICK                    SLAVE USER    SLAVE                 STATUS     CHECKPOINT STATUS    CRAWL STATUS          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
spitfire     master        /bricks/brick0/master_brick0    root          falcon::slave         Faulty     N/A                  Changelog Crawl       
typhoon      master        /bricks/brick0/master_brick3    root          lightning::slave      Faulty     N/A                  Changelog Crawl       
metallica    master        /bricks/brick0/master_brick5    root          interceptor::slave    Passive    N/A                  N/A                   
mustang      master        /bricks/brick0/master_brick1    root          interceptor::slave    Passive    N/A                  N/A                   
harrier      master        /bricks/brick0/master_brick2    root          hornet::slave         Passive    N/A                  N/A                   
ccr          master        /bricks/brick0/master_brick4    root          falcon::slave         Passive    N/A                  N/A

Comment 2 shilpa 2015-03-13 17:34:15 UTC

Also saw "OSError: [Errno 61] No data available" error for hardlinks:

[2015-03-13 22:47:46.456286] E [syncdutils(/bricks/brick0/master_brick0):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 645, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1329, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 553, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1334, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1017, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 981, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 61] No data available: '.gfid/b8c9ca6c-cb69-4930-b4d5-c50ca7710f66/hardlink_to_files/55030214%%8W5FSWYQIO'
[2015-03-13 22:47:46.459382] I [syncdutils(/bricks/brick0/master_brick0):214:finalize] <top>: exiting.
[2015-03-13 22:47:46.464014] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-03-13 22:47:46.464807] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-03-13 22:47:46.887911] I [monitor(monitor):280:monitor] Monitor: worker(/bricks/brick0/master_brick0) died in startup phase

Comment 3 Rahul Hinduja 2015-03-16 08:45:12 UTC

Hit the same issue with just changelog crawl. Georep session was never stopped. 

Executed following from fuse and nfs client of master volume:

Fuse:
====

[root@wingo master]# for i in {1..10}; do cp -rf /etc etc.1 ; sleep 10 ; rm -rf etc.1 ; sleep 10 ; done

NFS:
====

[root@wingo master_nfs]# for i in {1..10}; do cp -rf /etc etc.2 ; sleep 10 ; rm -rf etc.2 ; sleep 10 ; done


Status moved from ACTIVE to Faulty as:
======================================


[root@georep1 ~]# gluster volume geo-replication vol0 10.70.46.100::vol1 status
 
MASTER NODE    MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 STATUS     CHECKPOINT STATUS    CRAWL STATUS          
----------------------------------------------------------------------------------------------------------------------------------
georep1        vol0          /rhs/brick1/b1    root          10.70.46.101::vol1    Active     N/A                  Changelog Crawl       
georep1        vol0          /rhs/brick2/b2    root          10.70.46.100::vol1    Active     N/A                  Changelog Crawl       
georep3        vol0          /rhs/brick1/b1    root          10.70.46.101::vol1    Passive    N/A                  N/A                   
georep3        vol0          /rhs/brick2/b2    root          10.70.46.100::vol1    Passive    N/A                  N/A                   
georep2        vol0          /rhs/brick1/b1    root          10.70.46.100::vol1    Passive    N/A                  N/A                   
georep2        vol0          /rhs/brick2/b2    root          10.70.46.101::vol1    Passive    N/A                  N/A                   
[root@georep1 ~]# 
[root@georep1 ~]# 
[root@georep1 ~]# 
[root@georep1 ~]# 
[root@georep1 ~]# gluster volume geo-replication vol0 10.70.46.100::vol1 status
 
MASTER NODE    MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 STATUS     CHECKPOINT STATUS    CRAWL STATUS       
-------------------------------------------------------------------------------------------------------------------------------
georep1        vol0          /rhs/brick1/b1    root          10.70.46.101::vol1    faulty     N/A                  N/A                
georep1        vol0          /rhs/brick2/b2    root          10.70.46.100::vol1    faulty     N/A                  N/A                
georep3        vol0          /rhs/brick1/b1    root          10.70.46.101::vol1    Passive    N/A                  N/A                
georep3        vol0          /rhs/brick2/b2    root          10.70.46.100::vol1    Passive    N/A                  N/A                
georep2        vol0          /rhs/brick1/b1    root          10.70.46.100::vol1    Passive    N/A                  N/A                
georep2        vol0          /rhs/brick2/b2    root          10.70.46.101::vol1    Passive    N/A                  N/A                
[root@georep1 ~]


From faulty it went to active and changed the crawl to  history crawl:
======================================================================

[root@georep1 ~]# gluster volume geo-replication vol0 10.70.46.100::vol1 status
 
MASTER NODE    MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 STATUS     CHECKPOINT STATUS    CRAWL STATUS        
--------------------------------------------------------------------------------------------------------------------------------
georep1        vol0          /rhs/brick1/b1    root          10.70.46.101::vol1    Active     N/A                  History Crawl       
georep1        vol0          /rhs/brick2/b2    root          10.70.46.100::vol1    Active     N/A                  History Crawl       
georep3        vol0          /rhs/brick1/b1    root          10.70.46.101::vol1    Passive    N/A                  N/A                 
georep3        vol0          /rhs/brick2/b2    root          10.70.46.100::vol1    Passive    N/A                  N/A                 
georep2        vol0          /rhs/brick1/b1    root          10.70.46.100::vol1    Passive    N/A                  N/A                 
georep2        vol0          /rhs/brick2/b2    root          10.70.46.101::vol1    Passive    N/A                  N/A                 
[root@georep1 ~]# 


Log Snippet:
============

[2015-03-16 18:53:29.265461] I [syncdutils(/rhs/brick1/b1):214:finalize] <top>: exiting.
[2015-03-16 18:53:29.264503] E [syncdutils(/rhs/brick2/b2):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 645, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1329, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 553, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1334, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1017, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 981, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 39] Directory not empty: '.gfid/70b3b3b8-3e8d-4f32-9123-fab73574ce91/yum'
[2015-03-16 18:53:29.266381] I [syncdutils(/rhs/brick2/b2):214:finalize] <top>: exiting.
[2015-03-16 18:53:29.268383] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-03-16 18:53:29.268708] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-03-16 18:53:29.269619] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-03-16 18:53:29.270034] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-03-16 18:53:29.685185] I [monitor(monitor):280:monitor] Monitor: worker(/rhs/brick1/b1) died in startup phase
[2015-03-16 18:53:30.220942] I [monitor(monitor):280:monitor] Monitor: worker(/rhs/brick2/b2) died in startup phase




Output of history log:
======================

[root@georep1 ssh%3A%2F%2Froot%4010.70.46.100%3Agluster%3A%2F%2F127.0.0.1%3Avol1]# find .
.
./c19b89ac45352ab8c894d210d136dd56
./c19b89ac45352ab8c894d210d136dd56/xsync
./c19b89ac45352ab8c894d210d136dd56/.history
./c19b89ac45352ab8c894d210d136dd56/.history/tracker
./c19b89ac45352ab8c894d210d136dd56/.history/.current
./c19b89ac45352ab8c894d210d136dd56/.history/.processed
./c19b89ac45352ab8c894d210d136dd56/.history/.processing
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512222
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512176
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512161
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512131
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512191
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512207
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512146
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512116
./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512101
./c19b89ac45352ab8c894d210d136dd56/tracker
./c19b89ac45352ab8c894d210d136dd56/.current
./c19b89ac45352ab8c894d210d136dd56/.processed
./c19b89ac45352ab8c894d210d136dd56/.processed/archive_201503.tar
./c19b89ac45352ab8c894d210d136dd56/.processing
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512372
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512342
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512282
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512357
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512327
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512267
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512387
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512312
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512237
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512252
./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512297
./764586b145d7206a154a778f64bd2f50
./764586b145d7206a154a778f64bd2f50/xsync
./764586b145d7206a154a778f64bd2f50/.history
./764586b145d7206a154a778f64bd2f50/.history/tracker
./764586b145d7206a154a778f64bd2f50/.history/.current
./764586b145d7206a154a778f64bd2f50/.history/.processed
./764586b145d7206a154a778f64bd2f50/.history/.processing
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512222
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512176
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512161
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512131
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512207
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512146
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512192
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512116
./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512101
./764586b145d7206a154a778f64bd2f50/tracker
./764586b145d7206a154a778f64bd2f50/.current
./764586b145d7206a154a778f64bd2f50/.processed
./764586b145d7206a154a778f64bd2f50/.processed/archive_201503.tar
./764586b145d7206a154a778f64bd2f50/.processing
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512372
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512342
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512282
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512357
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512327
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512267
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512387
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512312
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512237
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512252
./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512297
[root@georep1 ssh%3A%2F%2Froot%4010.70.46.100%3Agluster%3A%2F%2F127.0.0.1%3Avol1]#

Comment 4 Aravinda VK 2015-04-12 16:54:26 UTC

Upstream Patch sent for review
http://review.gluster.org/#/c/10204/

Comment 7 Rahul Hinduja 2015-07-15 14:03:22 UTC

Verified with build: glusterfs-3.7.1-9.el6rhs.x86_64

Worker didn't crash with "Directory not Empty" and "ESTALE". Though I see lots of logs with "directory not empty" during recursive remove and eventually directories do not get remove from the slave. This is tracked via 

https://bugzilla.redhat.com/show_bug.cgi?id=1235633#c5

Moving this bug to verified state. 

[root@georep1 ~]# grep -i "OSError" /var/log/glusterfs/geo-replication/master/* [root@georep1 ~]# 


[root@georep3 ~]# grep -i "OSError" /var/log/glusterfs/geo-replication-slaves/1712d430-4ccd-4560-9f7c-d826537a9600\:gluster%3A%2F%2F127.0.0.1%3Aslave.*
[root@georep3 ~]#

Comment 10 errata-xmlrpc 2015-07-29 04:39:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html