1425695 – [Geo-rep] If for some reason MKDIR failed to sync, it should not proceed further.

Bug 1425695 - [Geo-rep] If for some reason MKDIR failed to sync, it should not proceed further.

Summary: [Geo-rep] If for some reason MKDIR failed to sync, it should not proceed furt...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Kotresh HR
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:
Depends On:	1411607 1441933
Blocks:	1417147
TreeView+	depends on / blocked

Reported:	2017-02-22 06:42 UTC by Kotresh HR
Modified:	2017-09-21 04:57 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-19
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1411607
Environment:
Last Closed:	2017-09-21 04:33:25 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Kotresh HR 2017-02-22 06:42:35 UTC

+++ This bug was initially created as a clone of Bug #1411607 +++

Description of problem:
If for some reason MKDIR failed to sync, it will log and proceed further.
Allowing it to further process will fail the entire directory tree syncing to slave. 


Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Setup master and slave gluster volume
2. Setup geo-rep session between them
3. Introduce directory sync failure by some means. To manually introduce this error, delete a directory on slave and Create files and directories under the deleted directory on master. 

Actual results:
Geo-rep logs the errors and proceeds further

Expected results:
Geo-rep should not proceed if it's a directory error.

Additional info:

--- Additional comment from Worker Ant on 2017-01-10 01:22:34 EST ---

REVIEW: http://review.gluster.org/16364 (geo-rep: Handle directory sync failure as hard error) posted (#1) for review on master by Kotresh HR (khiremat)

--- Additional comment from Worker Ant on 2017-01-13 01:55:18 EST ---

REVIEW: http://review.gluster.org/16364 (geo-rep: Handle directory sync failure as hard error) posted (#2) for review on master by Kotresh HR (khiremat)

--- Additional comment from Worker Ant on 2017-01-13 08:00:08 EST ---

COMMIT: http://review.gluster.org/16364 committed in master by Aravinda VK (avishwan) 
------
commit 91ad7fe0ed8e8ce8f5899bb5ebbbbe57ede7dd43
Author: Kotresh HR <khiremat>
Date:   Tue Jan 10 00:30:42 2017 -0500

    geo-rep: Handle directory sync failure as hard error
    
    If directory creation is failed, return immediately before
    further processing. Allowing it to further process will
    fail the entire directory tree syncing to slave. Hence
    master will log and raise exception if it's directory
    failure. Earlier, master used to log the failure and
    proceed.
    
    Change-Id: Iba2a8b5d3d0092e7a9c8a3c2cdf9e6e29c73ddf0
    BUG: 1411607
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/16364
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>

Comment 2 Kotresh HR 2017-02-22 06:43:51 UTC

Upstream Patch:
http://review.gluster.org/16364  (master)

Comment 3 Kotresh HR 2017-02-22 06:58:24 UTC

It is in upstream 3.10 as part of branch out from master.

Comment 5 Atin Mukherjee 2017-03-24 08:56:21 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101291/

Comment 7 Rochelle 2017-07-17 11:29:03 UTC

Verified this bug on the build: glusterfs-geo-replication-3.8.4-32.el7rhgs.x86_64

Tried the following scenario:

1. Create geo-replication between master and slave
2. Mounted master volume and created dir like "first" and "second"
3. Created few files inside first and second. All synced to slave properly
4. Deleted first at slave
5. Created few files at Master and first
6. geo-replication logs reported errors [1] but did not go to faulty
7. created a dir at Master under first
8. geo-replication logs reported errors [2] and session went to faulty

Step 6 to 8 are expected, moving this bug to verified state. 

[1]: 

[2017-07-17 11:16:14.433830] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '031e1a47-37d4-4a47-961f-6c43893c982e', 'gid': 0, 'mode': 33188, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/f4', 'op': 'CREATE'}, 2)
[2017-07-17 11:16:14.436542] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '11c5ea70-c508-479c-baf2-2112ddb8cec2', 'gid': 0, 'mode': 33188, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/f6', 'op': 'CREATE'}, 2)
[2017-07-17 11:16:14.455935] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: META FAILED: ({'go': '.gfid/11c5ea70-c508-479c-baf2-2112ddb8cec2', 'stat': {'atime': 1500290163.826682, 'gid': 0, 'mtime': 1500290163.826682, 'mode': 33188, 'uid': 0}, 'op': 'META'}, 2)
[2017-07-17 11:16:14.456257] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: META FAILED: ({'go': '.gfid/031e1a47-37d4-4a47-961f-6c43893c982e', 'stat': {'atime': 1500290160.922627, 'gid': 0, 'mtime': 1500290160.922627, 'mode': 33188, 'uid': 0}, 'op': 'META'}, 2)
[2017-07-17 11:24:02.33834] I [master(/rhs/brick1/b1):1125:crawl] _GMaster: slave's time: (1500290173, 0)


[2]:

[2017-07-17 11:24:02.57640] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': 'e9cab0d2-e777-4089-b6fa-ec886c0c929d', 'gid': 0, 'mode': 16877, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/test', 'op': 'MKDIR'}, 2)
[2017-07-17 11:24:02.58017] E [syncdutils(/rhs/brick1/b1):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2017-07-17 11:24:02.58984] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-07-17 11:24:02.65656] I [repce(/rhs/brick1/b1):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-07-17 11:24:02.66196] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-07-17 11:24:02.86541] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-07-17 11:24:12.277601] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.37.105:gluster://localhost:slave
[2017-07-17 11:24:12.389444] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-07-17 11:24:12.389770] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining...
[2017-07-17 11:24:18.189002] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 5.7992 secs

Comment 9 errata-xmlrpc 2017-09-21 04:33:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 10 errata-xmlrpc 2017-09-21 04:57:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.