Bug 1249547 - [geo-rep]: rename followed by deletes causes ESTALE
[geo-rep]: rename followed by deletes causes ESTALE
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: geo-replication (Show other bugs)
3.7.0
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kotresh HR
: ZStream
Depends On: 1239075 1247529
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-03 05:44 EDT by Kotresh HR
Modified: 2015-09-09 05:38 EDT (History)
11 users (show)

See Also:
Fixed In Version: glusterfs-3.7.4
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1247529
Environment:
Last Closed: 2015-09-09 05:38:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Kotresh HR 2015-08-03 05:44:02 EDT
+++ This bug was initially created as a clone of Bug #1247529 +++

+++ This bug was initially created as a clone of Bug #1239075 +++

Description of problem:
=======================
Ran the tests which does the following FOP's inorder:

Create, chmod, chown, chgrp, symlink, hardlink, truncate, rename, remove. 

The above fops are successful and they are successfully synced to slave. But the logs on Master and Slave are as follows:

Master:
=======

[2015-07-03 13:36:43.154763] E [syncdutils(/bricks/brick0/master_brick0):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 580, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1161, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1070, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 948, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 903, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 116] Stale file handle: '.gfid/fece7967-616b-4d13-add7-96f6a4022e11/55958721%%BO54CXD7RN'
[2015-07-03 13:36:43.156742] I [syncdutils(/bricks/brick0/master_brick0):220:finalize] <top>: exiting.
[2015-07-03 13:36:43.159702] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.



Slave:
======

[2015-07-03 13:36:38.359909] I [resource(slave):844:service_loop] GLUSTER: slave listening
[2015-07-03 13:36:43.149735] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 717, in entry_ops
    st = lstat(entry)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 493, in lstat
    return os.lstat(e)
OSError: [Errno 116] Stale file handle: '.gfid/fece7967-616b-4d13-add7-96f6a4022e11/55958721%%BO54CXD7RN'
[2015-07-03 13:36:43.158221] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-07-03 13:36:43.158576] I [syncdutils(slave):220:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
=============================================================


How reproducible:
=================
2/2

Steps to Reproduce:
===================
1. Create geo-rep session between Master (3x2) and Slave (3x2)
2. Run the following fops in sequential order and check the arequal after each fop:

2015-07-03 13:03:31,870 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=create /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:05:53,581 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=chmod /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:08:17,690 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=chown /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:10:41,876 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=chgrp /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:13:06,050 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=symlink /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:15:37,194 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=hardlink /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:18:16,751 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=truncate /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
2015-07-03 13:21:06,530 INFO run Executing crefi --multi -n 5 -b 5 -d 5 --max=10k --min=5k --random -T 5 -t text --fop=rename /mnt/glusterfs 1>/dev/null 2>&1 on wingo.lab.eng.blr.redhat.com
Comment 1 Anand Avati 2015-08-06 02:17:28 EDT
COMMIT: http://review.gluster.org/11820 committed in release-3.7 by Venky Shankar (vshankar@redhat.com) 
------
commit 2fd4ae628c7940ef5b505666afbf5eae0cc655b9
Author: Kotresh HR <khiremat@redhat.com>
Date:   Tue Jul 28 14:37:47 2015 +0530

    geo-rep: Do not crash worker on ESTALE
    
    Handle ESTALE returned by lstat gracefully
    by retrying it. Do not crash the worker.
    
    BUG: 1249547
    Change-Id: I57fb9933900153ab41c3d9b73748b1cdaa8d89ca
    Reviewed-on: http://review.gluster.org/11772
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
    Tested-by: NetBSD Build System <jenkins@build.gluster.org>
    Reviewed-by: Aravinda VK <avishwan@redhat.com>
    Reviewed-by: Venky Shankar <vshankar@redhat.com>
    Signed-off-by: Kotresh HR <khiremat@redhat.com>
    Reviewed-on: http://review.gluster.org/11820
    Reviewed-by: Milind Changire <mchangir@redhat.com>
Comment 2 Kaushal 2015-09-09 05:38:54 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.4, please open a new bug report.

glusterfs-3.7.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12496
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.