Bug 1339471 - [geo-rep]: Worker died with [Errno 2] No such file or directory
Summary: [geo-rep]: Worker died with [Errno 2] No such file or directory
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Aravinda VK
QA Contact:
URL:
Whiteboard:
Depends On: 1339159
Blocks: 1345882 1345883
TreeView+ depends on / blocked
 
Reported: 2016-05-25 06:36 UTC by Aravinda VK
Modified: 2017-03-27 18:18 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.9.0
Clone Of: 1339159
: 1345882 1345883 (view as bug list)
Environment:
Last Closed: 2017-03-27 18:18:24 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Aravinda VK 2016-05-25 06:36:52 UTC
+++ This bug was initially created as a clone of Bug #1339159 +++

Description of problem:
=======================

Upon running the geo-rep regression cases, found the following traceback while it was processing xsync changelog:

[2016-05-23 15:13:28.683130] I [resource(/bricks/brick0/master_brick0):1491:service_loop] GLUSTER: Register time: 1464016408
[2016-05-23 15:13:28.712944] I [master(/bricks/brick0/master_brick0):510:crawlwrap] _GMaster: primary master with volume id 7590ca29-59de-403a-95ff-10e229a403b6 ...
[2016-05-23 15:13:28.864242] I [master(/bricks/brick0/master_brick0):519:crawlwrap] _GMaster: crawl interval: 1 seconds
[2016-05-23 15:13:28.870495] I [master(/bricks/brick0/master_brick0):466:mgmt_lock] _GMaster: Got lock : /bricks/brick0/master_brick0 : Becoming ACTIVE
[2016-05-23 15:13:29.163460] I [master(/bricks/brick0/master_brick0):1163:crawl] _GMaster: starting history crawl... turns: 1, stime: (1464016374, 0), etime: 1464016409
[2016-05-23 15:13:30.165673] I [master(/bricks/brick0/master_brick0):1192:crawl] _GMaster: slave's time: (1464016374, 0)
[2016-05-23 15:13:31.970442] I [master(/bricks/brick0/master_brick0):1206:crawl] _GMaster: finished history crawl syncing, endtime: 1464016405, stime: (1464016404, 0)
[2016-05-23 15:13:34.646481] I [master(/bricks/brick1/master_brick6):1121:crawl] _GMaster: slave's time: (1464016396, 0)
[2016-05-23 15:13:43.984873] I [master(/bricks/brick0/master_brick0):1163:crawl] _GMaster: starting history crawl... turns: 2, stime: (1464016404, 0), etime: 1464016423
[2016-05-23 15:13:43.986049] I [master(/bricks/brick0/master_brick0):1206:crawl] _GMaster: finished history crawl syncing, endtime: 1464016405, stime: (1464016404, 0)
[2016-05-23 15:13:43.986222] I [resource(/bricks/brick0/master_brick0):1500:service_loop] GLUSTER: Partial history available, using xsync crawl after consuming history till 1464016405
[2016-05-23 15:13:43.993215] I [master(/bricks/brick0/master_brick0):510:crawlwrap] _GMaster: primary master with volume id 7590ca29-59de-403a-95ff-10e229a403b6 ...
[2016-05-23 15:13:44.8985] I [master(/bricks/brick0/master_brick0):519:crawlwrap] _GMaster: crawl interval: 60 seconds
[2016-05-23 15:13:44.16269] I [master(/bricks/brick0/master_brick0):1271:crawl] _GMaster: starting hybrid crawl..., stime: (1464016404, 0)
[2016-05-23 15:13:45.20493] I [master(/bricks/brick0/master_brick0):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.196%3Agluster%3A%2F%2F127.0.0.1
%3Aslave/4b7a065288ce3187adad4d6439fb4f75/xsync/XSYNC-CHANGELOG.1464016424
[2016-05-23 15:13:45.234045] E [syncdutils(/bricks/brick0/master_brick0):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 708, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1501, in service_loop
    g1.crawlwrap(oneshot=True, register_time=register_time)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1286, in crawl
    self.upd_stime(item[1][1], item[1][0])
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1069, in upd_stime
    self.sendmark(path, stime)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 641, in sendmark
    self.set_slave_xtime(path, mark)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 182, in set_slave_xtime
    self.slave.server.set_stime(path, self.uuid, mark)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1439, in <lambda>
    mark)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 327, in ff
    return f(*a)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 539, in set_stime
    struct.pack('!II', *mark))
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 2] No such file or directory
[2016-05-23 15:13:45.240242] I [syncdutils(/bricks/brick0/master_brick0):220:finalize] <top>: exiting.




When worker restarted, it registered with the time: 1464016408 and started history crawl with stime: 1464016374, and endtime {Current Time requested}: 1464016409. In first trial it returned till 1464016405. 

Again in another attempt the history changelogs were requested with stime: 1464016404 and etime: 1464016423. Since the changelog roleover hasn't been done, it returned till 1464016405.

Due to this we had partial history available and started xsync crawl which again died with No Such file or directory.

In the next history crawl, releover is completed and the finished endtime returned (1464016457) is more than registered time (1464016440) and sync completes using history crawl. 



Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64
glusterfs-3.7.9-5.el7rhgs.x86_64


How reproducible:
=================

Observed Once.


Steps to Reproduce:
===================
Will work on the steps and update BZ. In general the scenario would be:

1. 2 history trial should finish before changelog roleover happens to cause partial history crawl
2. rmdir on master before the changelog is processed to sync to slave to cause no such file or directory

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-05-24 05:27:19 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

Comment 1 Vijay Bellur 2016-05-25 06:45:02 UTC
REVIEW: http://review.gluster.org/14529 (geo-rep: Handle stime/xtime set failures) posted (#1) for review on master by Aravinda VK (avishwan)

Comment 2 Vijay Bellur 2016-06-13 11:27:17 UTC
COMMIT: http://review.gluster.org/14529 committed in master by Aravinda VK (avishwan) 
------
commit 1a348bfaeb9f2a50ec8ce27e5477e9b430c58b3c
Author: Aravinda VK <avishwan>
Date:   Wed May 25 11:56:56 2016 +0530

    geo-rep: Handle stime/xtime set failures
    
    While setting stime/xtime, if the file or directory is already
    deleted then Geo-rep will crash with ENOENT.
    
    With this patch, Geo-rep will ignores ENOENT since stime/xtime can't
    be applied on a deleted file/directory.
    
    Change-Id: I2d90569e51565f81ae53fcb23323e4f47c9e9672
    Signed-off-by: Aravinda VK <avishwan>
    BUG: 1339471
    Reviewed-on: http://review.gluster.org/14529
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Saravanakumar Arumugam <sarumuga>
    Reviewed-by: Kotresh HR <khiremat>

Comment 3 Shyamsundar 2017-03-27 18:18:24 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.