Bug 1222856

Summary:	[geo-rep]: worker died with "ESTALE" when performed rm -rf on a directory from mount of master volume
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	geo-replication	Assignee:	Aravinda VK <avishwan>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	rhgs-3.1	CC:	aavati, annair, asrivast, avishwan, bmohanra, csaba, khiremat, nlevinki, nsathyan, vagarwal
Target Milestone:	---
Target Release:	RHGS 3.1.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.1-6	Doc Type:	Bug Fix
Doc Text:	Previously, when DHT could not resolve a GFID or path, it raised an ESTALE error similar to ENOENT error. Due to unhandled ESTALE exception, Geo-rep worker would crash and the tracebacks are printed in the log files. With this release, the ESTALE errors in Geo-rep worker is handled similar to the ENOENT errors and Geo-rep worker does not crash due to this.	Story Points:	---
Clone Of:
Clones:	1223280 1232912 (view as bug list)		Environment:
Last Closed:	2015-07-29 04:43:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1223286
Bug Blocks:	1202842, 1223636, 1232912, 1236093

Description Rahul Hinduja 2015-05-19 10:14:21 UTC

Description of problem:
=======================

Whenever perfomred rm -rf on the master volume, the worker died with the backtrace as:


[2015-05-19 15:33:13.868683] E [syncdutils(/rhs/brick2/b2):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1440, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 580, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1150, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1059, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 946, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 902, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 116] Stale file handle
[2015-05-19 15:33:13.870326] I [syncdutils(/rhs/brick2/b2):220:finalize] <top>: exiting.
[2015-05-19 15:33:13.874784] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.

And with everytime monitor tries to spawn the process, it dies in startup phase.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.0-2.el6rhs.x86_64


How reproducible:
================

Tried couple of times and was successful in reproducing it in as many times


Steps Carried:
==============
1. Created master cluster 
2. Created and started master volume
3. Created shared volume (gluster_shared_storage)
4. Mounted the shared volume on /var/run/gluster/shared_storage
5. Created Slave cluster
6. Created and Started slave volume
7. Created geo-rep session between master and slave
8. Configured use_meta_volume true
9. Started geo-rep
10. Mounted master volume over Fuse and NFS to client
11. Copied files /etc{1..10} from fuse mount
12. Copied files /etc{11.20} from NFS mount
13. Sync completed successfully
14. Removed the files etc.2 from fuse and etc.12 from NFS
15. Looked into the geo-rep session it was faulty 
16. Looked into the logs, it showed continuous traceback

Actual results:
===============

It crashed and comes back with crawl type as history


Expected results:
=================

Worker should not crash and it should handle ESTALE gracefully

Comment 2 Aravinda VK 2015-06-02 08:11:09 UTC

Patches:
master: http://review.gluster.org/#/c/10837/
release-3.7: http://review.gluster.org/10913
downstream: https://code.engineering.redhat.com/gerrit/#/c/49674/

Comment 4 Rahul Hinduja 2015-06-11 11:42:23 UTC

I still see the issue with build: glusterfs-3.7.1-1

Moving bug back to assigned state;

[root@georep1 scripts]# rpm -qa | grep gluster
glusterfs-client-xlators-3.7.1-1.el6rhs.x86_64
glusterfs-server-3.7.1-1.el6rhs.x86_64
glusterfs-3.7.1-1.el6rhs.x86_64
glusterfs-api-3.7.1-1.el6rhs.x86_64
glusterfs-cli-3.7.1-1.el6rhs.x86_64
glusterfs-geo-replication-3.7.1-1.el6rhs.x86_64
glusterfs-libs-3.7.1-1.el6rhs.x86_64
glusterfs-fuse-3.7.1-1.el6rhs.x86_64
glusterfs-debuginfo-3.7.1-1.el6rhs.x86_64
[root@georep1 scripts]# cat /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.154%3Agluster%3A%2F%2F127.0.0.1%3Aslave.log | grep "OSError"
[2015-06-11 22:34:23.111248] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 20852:140282122651392:1434042220.8 (entry_ops) failed on peer with OSError
[2015-06-11 22:34:46.175925] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 21689:140594955093760:1434042280.85 (entry_ops) failed on peer with OSError
OSError: [Errno 116] Stale file handle
[2015-06-11 22:35:08.149015] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 21766:140460004030208:1434042303.43 (entry_ops) failed on peer with OSError
OSError: [Errno 116] Stale file handle
[root@georep1 scripts]#

Comment 6 Kotresh HR 2015-06-27 05:13:57 UTC

Upstream Patch (Master):
http://review.gluster.org/#/c/11296/

Upstream Patch (3.7):
http://review.gluster.org/#/c/11430/

Downstream Patch:
https://code.engineering.redhat.com/gerrit/#/c/51709/

Comment 8 Bhavana 2015-07-25 10:38:18 UTC

Hi Aravinda,

The doc text is updated. Please review the same and share your technical review comments. If it looks ok, then sign-off on the same.

Regards,
Bhavana

Comment 9 errata-xmlrpc 2015-07-29 04:43:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html