Bug 1598884

Summary:	[geo-rep]: [Errno 2] No such file or directory
Product:	[Community] GlusterFS	Reporter:	Kotresh HR <khiremat>
Component:	geo-replication	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	avishwan, bugs, csaba, rallan, rhinduja, rhs-bugs, sankarshan, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-5.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1598384
Clones:	1611114 (view as bug list)		Environment:
Last Closed:	2018-10-23 15:13:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1598384
Bug Blocks:	1611114

Description Kotresh HR 2018-07-06 18:01:19 UTC

+++ This bug was initially created as a clone of Bug #1598384 +++

Description of problem:
=========================
Worker crashed with the following traceback while using geo-rep scheduler:

[2018-07-04 06:35:30.242285] E [syncdutils(/rhs/brick2/b4):348:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 803, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1568, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1470, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1370, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1114, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 228, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 210, in __call__
    raise res
OSError: [Errno 2] No such file or directory: '/rhs/brick1/b1/.glusterfs/2e/94/2e9400f3-61c5-4943-bc5d-26562fc7f47d'
[2018-07-04 06:35:30.294939] I [syncdutils(/rhs/brick2/b4):288:finalize] <top>: exiting.

Version-Release number of selected component (if applicable):
============================================================
mainline


How reproducible:
================
1/1


Steps to Reproduce:
===================
1.Have a geo-replication session up 
2.Create IO on the master
3.Run the scheduler: python /usr/share/glusterfs/scripts/schedule_georep.py master 10.70.42.164 slave

The geo-rep scheduler does the following:
1. Stop Geo-replication if Started
2. Start Geo-replication
3. Set Checkpoint
4. Check the Status and see Checkpoint is Complete.(LOOP)
5. If checkpoint complete, Stop Geo-replication

Actual results:
===============
Worker crashed with No such file or directory

Expected results:
=================
Worker should not crash


Traceback on the slave:
-----------------------
[2018-07-04 06:37:14.873959] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 644, in entry_ops
    gfid[2:4], gfid))
OSError: [Errno 2] No such file or directory: '/rhs/brick1/b1/.glusterfs/e1/ab/e1ab496b-a1bb-4f1c-a543-a44ee1ee0272'
[2018-07-04 06:37:14.942406] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.

Comment 1 Worker Ant 2018-07-06 18:23:10 UTC

REVIEW: https://review.gluster.org/20473 (geo-rep: Fix issues with gfid conflict handling) posted (#1) for review on master by Kotresh HR

Comment 2 Worker Ant 2018-07-20 16:23:45 UTC

COMMIT: https://review.gluster.org/20473 committed in master by "Kotresh HR" <khiremat> with a commit message- geo-rep: Fix issues with gfid conflict handling

1. MKDIR/RMDIR is recorded on all bricks. So if
   one brick succeeds creating it, other bricks
   should ignore it. But this was not happening.
   The fix rename of directories in hybrid crawl,
   was trying to rename the directory to itself
   and in the process crashing with ENOENT if the
   directory is removed.

2. If file is created, deleted and a directory is
   created with same name, it was failing to sync.
   Again the issue is around the fix for rename
   of directories in hybrid crawl. Fixed the same.

   If the same case was done with hardlink present
   for the file, it was failing. This patch fixes
   that too.

fixes: bz#1598884
Change-Id: I6f3bca44e194e415a3d4de3b9d03cc8976439284
Signed-off-by: Kotresh HR <khiremat>

Comment 3 Shyamsundar 2018-10-23 15:13:25 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/