Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1236093

Summary:	[geo-rep]: worker died with "ESTALE" when performed rm -rf on a directory from mount of master volume
Product:	[Community] GlusterFS	Reporter:	Kotresh HR <khiremat>
Component:	geo-replication	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	high
Version:	3.7.1	CC:	aavati, annair, asrivast, avishwan, bugs, csaba, gluster-bugs, nlevinki, rhinduja, storage-qa-internal, vagarwal
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.3	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1232912	Environment:
Last Closed:	2015-07-30 09:48:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1222856, 1223286, 1232912
Bug Blocks:	1202842, 1223636

Comment 1 Kotresh HR 2015-06-26 14:08:55 UTC

Description of problem:
=======================

Whenever perfomred rm -rf on the master volume, the worker died with the backtrace as:


[2015-05-19 15:33:13.868683] E [syncdutils(/rhs/brick2/b2):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1440, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 580, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1150, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1059, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 946, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 902, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 116] Stale file handle
[2015-05-19 15:33:13.870326] I [syncdutils(/rhs/brick2/b2):220:finalize] <top>: exiting.
[2015-05-19 15:33:13.874784] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.

And with everytime monitor tries to spawn the process, it dies in startup phase.


Version-Release number of selected component (if applicable):
=============================================================



How reproducible:
================

Tried couple of times and was successful in reproducing it in as many times


Steps Carried:
==============
1. Created master cluster 
2. Created and started master volume
3. Created shared volume (gluster_shared_storage)
4. Mounted the shared volume on /var/run/gluster/shared_storage
5. Created Slave cluster
6. Created and Started slave volume
7. Created geo-rep session between master and slave
8. Configured use_meta_volume true
9. Started geo-rep
10. Mounted master volume over Fuse and NFS to client
11. Copied files /etc{1..10} from fuse mount
12. Copied files /etc{11.20} from NFS mount
13. Sync completed successfully
14. Removed the files etc.2 from fuse and etc.12 from NFS
15. Looked into the geo-rep session it was faulty 
16. Looked into the logs, it showed continuous traceback

Actual results:
===============

It crashed and comes back with crawl type as history


Expected results:
=================

Worker should not crash and it should handle ESTALE gracefully

Comment 2 Anand Avati 2015-06-26 14:10:51 UTC

REVIEW: http://review.gluster.org/11430 (geo-rep: ignore ESTALE as ENOENT) posted (#1) for review on release-3.7 by Kotresh HR (khiremat)

Comment 3 Anand Avati 2015-06-26 14:13:06 UTC

REVIEW: http://review.gluster.org/11430 (geo-rep: ignore ESTALE as ENOENT) posted (#2) for review on release-3.7 by Kotresh HR (khiremat)

Comment 4 Anand Avati 2015-06-28 14:47:46 UTC

COMMIT: http://review.gluster.org/11430 committed in release-3.7 by Vijay Bellur (vbellur) 
------
commit 088711acdaaf8935718171bd79ae053ae8fc3d75
Author: Aravinda VK <avishwan>
Date:   Wed Jun 17 15:46:01 2015 -0400

    geo-rep: ignore ESTALE as ENOENT
    
    When DHT can't resolve a File it raises ESTALE, ignore ESTALE errors
    same as ENOENT after retry.
    
    Affected places:
        Xattr.lgetxattr
        os.listdir
        os.link
        Xattr.lsetxattr
        os.chmod
        os.chown
        os.utime
        os.readlink
    
    BUG: 1236093
    Change-Id: I53f8dfa47911da93e0dcc20213afcbb47a14ccd8
    Reviewed-On: http://review.gluster.org/11296
    Original-Author: Aravinda VK <avishwan>
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/11430
    Tested-by: NetBSD Build System <jenkins.org>
    Reviewed-by: Milind Changire <mchangir>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 5 Kaushal 2015-07-30 09:48:56 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.3, please open a new bug report.

glusterfs-3.7.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12078
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user