1249547 – [geo-rep]: rename followed by deletes causes ESTALE

Bug 1249547 - [geo-rep]: rename followed by deletes causes ESTALE

Summary: [geo-rep]: rename followed by deletes causes ESTALE

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	3.7.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1239075 1247529
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-03 09:44 UTC by Kotresh HR
Modified:	2015-09-09 09:38 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.7.4
Clone Of:	1247529
Environment:
Last Closed:	2015-09-09 09:38:54 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kotresh HR 2015-08-03 09:44:02 UTC

+++ This bug was initially created as a clone of Bug #1247529 +++

+++ This bug was initially created as a clone of Bug #1239075 +++

Description of problem:
=======================
Ran the tests which does the following FOP's inorder:

Create, chmod, chown, chgrp, symlink, hardlink, truncate, rename, remove. 

The above fops are successful and they are successfully synced to slave. But the logs on Master and Slave are as follows:

Master:
=======

[2015-07-03 13:36:43.154763] E [syncdutils(/bricks/brick0/master_brick0):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 580, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1161, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1070, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 948, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 903, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 116] Stale file handle: '.gfid/fece7967-616b-4d13-add7-96f6a4022e11/55958721%%BO54CXD7RN'
[2015-07-03 13:36:43.156742] I [syncdutils(/bricks/brick0/master_brick0):220:finalize] <top>: exiting.
[2015-07-03 13:36:43.159702] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.



Slave:
======

[2015-07-03 13:36:38.359909] I [resource(slave):844:service_loop] GLUSTER: slave listening
[2015-07-03 13:36:43.149735] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 717, in entry_ops
    st = lstat(entry)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 493, in lstat
    return os.lstat(e)
OSError: [Errno 116] Stale file handle: '.gfid/fece7967-616b-4d13-add7-96f6a4022e11/55958721%%BO54CXD7RN'
[2015-07-03 13:36:43.158221] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-07-03 13:36:43.158576] I [syncdutils(slave):220:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
=============================================================


How reproducible:
=================
2/2

Steps to Reproduce:
===================
1. Create geo-rep session between Master (3x2) and Slave (3x2)
2. Run the following fops in sequential order and check the arequal after each fop:

2015-07-03 13:03:31,870 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=create /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:05:53,581 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=chmod /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:08:17,690 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=chown /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:10:41,876 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=chgrp /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:13:06,050 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=symlink /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:15:37,194 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=hardlink /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:18:16,751 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=truncate /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:21:06,530 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=rename /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com

Comment 1 Anand Avati 2015-08-06 06:17:28 UTC

COMMIT: http://review.gluster.org/11820 committed in release-3.7 by Venky Shankar (vshankar) 
------
commit 2fd4ae628c7940ef5b505666afbf5eae0cc655b9
Author: Kotresh HR <khiremat>
Date:   Tue Jul 28 14:37:47 2015 +0530

    geo-rep: Do not crash worker on ESTALE
    
    Handle ESTALE returned by lstat gracefully
    by retrying it. Do not crash the worker.
    
    BUG: 1249547
    Change-Id: I57fb9933900153ab41c3d9b73748b1cdaa8d89ca
    Reviewed-on: http://review.gluster.org/11772
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: NetBSD Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>
    Reviewed-by: Venky Shankar <vshankar>
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/11820
    Reviewed-by: Milind Changire <mchangir>

Comment 2 Kaushal 2015-09-09 09:38:54 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.4, please open a new bug report.

glusterfs-3.7.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12496
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.