1339159 – [geo-rep]: Worker died with [Errno 2] No such file or directory

Bug 1339159 - [geo-rep]: Worker died with [Errno 2] No such file or directory

Summary: [geo-rep]: Worker died with [Errno 2] No such file or directory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Aravinda VK
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1339471 1345882 1345883 1351515
TreeView+	depends on / blocked

Reported:	2016-05-24 09:27 UTC by Rahul Hinduja
Modified:	2019-12-16 05:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1339471 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:32:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Rahul Hinduja 2016-05-24 09:27:14 UTC

Description of problem:
=======================

Upon running the geo-rep regression cases, found the following traceback while it was processing xsync changelog:

[2016-05-23 15:13:28.683130] I [resource(/bricks/brick0/master_brick0):1491:service_loop] GLUSTER: Register time: 1464016408
[2016-05-23 15:13:28.712944] I [master(/bricks/brick0/master_brick0):510:crawlwrap] _GMaster: primary master with volume id 7590ca29-59de-403a-95ff-10e229a403b6 ...
[2016-05-23 15:13:28.864242] I [master(/bricks/brick0/master_brick0):519:crawlwrap] _GMaster: crawl interval: 1 seconds
[2016-05-23 15:13:28.870495] I [master(/bricks/brick0/master_brick0):466:mgmt_lock] _GMaster: Got lock : /bricks/brick0/master_brick0 : Becoming ACTIVE
[2016-05-23 15:13:29.163460] I [master(/bricks/brick0/master_brick0):1163:crawl] _GMaster: starting history crawl... turns: 1, stime: (1464016374, 0), etime: 1464016409
[2016-05-23 15:13:30.165673] I [master(/bricks/brick0/master_brick0):1192:crawl] _GMaster: slave's time: (1464016374, 0)
[2016-05-23 15:13:31.970442] I [master(/bricks/brick0/master_brick0):1206:crawl] _GMaster: finished history crawl syncing, endtime: 1464016405, stime: (1464016404, 0)
[2016-05-23 15:13:34.646481] I [master(/bricks/brick1/master_brick6):1121:crawl] _GMaster: slave's time: (1464016396, 0)
[2016-05-23 15:13:43.984873] I [master(/bricks/brick0/master_brick0):1163:crawl] _GMaster: starting history crawl... turns: 2, stime: (1464016404, 0), etime: 1464016423
[2016-05-23 15:13:43.986049] I [master(/bricks/brick0/master_brick0):1206:crawl] _GMaster: finished history crawl syncing, endtime: 1464016405, stime: (1464016404, 0)
[2016-05-23 15:13:43.986222] I [resource(/bricks/brick0/master_brick0):1500:service_loop] GLUSTER: Partial history available, using xsync crawl after consuming history till 1464016405
[2016-05-23 15:13:43.993215] I [master(/bricks/brick0/master_brick0):510:crawlwrap] _GMaster: primary master with volume id 7590ca29-59de-403a-95ff-10e229a403b6 ...
[2016-05-23 15:13:44.8985] I [master(/bricks/brick0/master_brick0):519:crawlwrap] _GMaster: crawl interval: 60 seconds
[2016-05-23 15:13:44.16269] I [master(/bricks/brick0/master_brick0):1271:crawl] _GMaster: starting hybrid crawl..., stime: (1464016404, 0)
[2016-05-23 15:13:45.20493] I [master(/bricks/brick0/master_brick0):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.196%3Agluster%3A%2F%2F127.0.0.1
%3Aslave/4b7a065288ce3187adad4d6439fb4f75/xsync/XSYNC-CHANGELOG.1464016424
[2016-05-23 15:13:45.234045] E [syncdutils(/bricks/brick0/master_brick0):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 708, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1501, in service_loop
    g1.crawlwrap(oneshot=True, register_time=register_time)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1286, in crawl
    self.upd_stime(item[1][1], item[1][0])
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1069, in upd_stime
    self.sendmark(path, stime)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 641, in sendmark
    self.set_slave_xtime(path, mark)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 182, in set_slave_xtime
    self.slave.server.set_stime(path, self.uuid, mark)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1439, in <lambda>
    mark)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 327, in ff
    return f(*a)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 539, in set_stime
    struct.pack('!II', *mark))
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 2] No such file or directory
[2016-05-23 15:13:45.240242] I [syncdutils(/bricks/brick0/master_brick0):220:finalize] <top>: exiting.




When worker restarted, it registered with the time: 1464016408 and started history crawl with stime: 1464016374, and endtime {Current Time requested}: 1464016409. In first trial it returned till 1464016405. 

Again in another attempt the history changelogs were requested with stime: 1464016404 and etime: 1464016423. Since the changelog roleover hasn't been done, it returned till 1464016405.

Due to this we had partial history available and started xsync crawl which again died with No Such file or directory.

In the next history crawl, releover is completed and the finished endtime returned (1464016457) is more than registered time (1464016440) and sync completes using history crawl. 



Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64
glusterfs-3.7.9-5.el7rhgs.x86_64


How reproducible:
=================

Observed Once.


Steps to Reproduce:
===================
Will work on the steps and update BZ. In general the scenario would be:

1. 2 history trial should finish before changelog roleover happens to cause partial history crawl
2. rmdir on master before the changelog is processed to sync to slave to cause no such file or directory

Comment 3 Aravinda VK 2016-05-25 06:49:50 UTC

Upstream Patch sent
http://review.gluster.org/#/c/14529/

Comment 25 Atin Mukherjee 2016-09-17 12:01:34 UTC

Upstream mainline : http://review.gluster.org/14529
Upstream 3.8 : http://review.gluster.org/14714

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 29 Rahul Hinduja 2017-02-27 18:34:14 UTC

Verified with the build: glusterfs-geo-replication-3.8.4-15.el7rhgs.x86_64

With the same steps as mentioned in comment 28. No worker crashed and files do get removed from master/slave. 

For such gfid's it logs following errors:

ssh%3A%2F%2Froot%4010.70.43.249%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Frhs%2Fbrick3%2Fb10.gluster.log:[2017-02-27 18:14:29.438200] E [MSGID: 109040] [dht-helper.c:1198:dht_migration_complete_check_task] 0-master-dht: e2616f70-56b2-4e1a-8519-50f796b38b50: failed to lookup the file on master-dht [Stale file handle]


Repeated the steps twice with 3.1.3 bits and with 3.2.0 bits. With 3.1.3, it is reproduced 2/2 and with 3.2.0 0/2

Moving this bug to verified state.

Comment 31 errata-xmlrpc 2017-03-23 05:32:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.