Bug 1425695

Summary: [Geo-rep] If for some reason MKDIR failed to sync, it should not proceed further.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Kotresh HR <khiremat>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED ERRATA QA Contact: Rochelle <rallan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asrivast, bugs, csaba, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-19 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1411607 Environment:
Last Closed: 2017-09-21 04:33:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1411607, 1441933    
Bug Blocks: 1417147    

Description Kotresh HR 2017-02-22 06:42:35 UTC
+++ This bug was initially created as a clone of Bug #1411607 +++

Description of problem:
If for some reason MKDIR failed to sync, it will log and proceed further.
Allowing it to further process will fail the entire directory tree syncing to slave. 


Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Setup master and slave gluster volume
2. Setup geo-rep session between them
3. Introduce directory sync failure by some means. To manually introduce this error, delete a directory on slave and Create files and directories under the deleted directory on master. 

Actual results:
Geo-rep logs the errors and proceeds further

Expected results:
Geo-rep should not proceed if it's a directory error.

Additional info:

--- Additional comment from Worker Ant on 2017-01-10 01:22:34 EST ---

REVIEW: http://review.gluster.org/16364 (geo-rep: Handle directory sync failure as hard error) posted (#1) for review on master by Kotresh HR (khiremat)

--- Additional comment from Worker Ant on 2017-01-13 01:55:18 EST ---

REVIEW: http://review.gluster.org/16364 (geo-rep: Handle directory sync failure as hard error) posted (#2) for review on master by Kotresh HR (khiremat)

--- Additional comment from Worker Ant on 2017-01-13 08:00:08 EST ---

COMMIT: http://review.gluster.org/16364 committed in master by Aravinda VK (avishwan) 
------
commit 91ad7fe0ed8e8ce8f5899bb5ebbbbe57ede7dd43
Author: Kotresh HR <khiremat>
Date:   Tue Jan 10 00:30:42 2017 -0500

    geo-rep: Handle directory sync failure as hard error
    
    If directory creation is failed, return immediately before
    further processing. Allowing it to further process will
    fail the entire directory tree syncing to slave. Hence
    master will log and raise exception if it's directory
    failure. Earlier, master used to log the failure and
    proceed.
    
    Change-Id: Iba2a8b5d3d0092e7a9c8a3c2cdf9e6e29c73ddf0
    BUG: 1411607
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/16364
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>

Comment 2 Kotresh HR 2017-02-22 06:43:51 UTC
Upstream Patch:
http://review.gluster.org/16364  (master)

Comment 3 Kotresh HR 2017-02-22 06:58:24 UTC
It is in upstream 3.10 as part of branch out from master.

Comment 5 Atin Mukherjee 2017-03-24 08:56:21 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101291/

Comment 7 Rochelle 2017-07-17 11:29:03 UTC
Verified this bug on the build: glusterfs-geo-replication-3.8.4-32.el7rhgs.x86_64

Tried the following scenario:

1. Create geo-replication between master and slave
2. Mounted master volume and created dir like "first" and "second"
3. Created few files inside first and second. All synced to slave properly
4. Deleted first at slave
5. Created few files at Master and first
6. geo-replication logs reported errors [1] but did not go to faulty
7. created a dir at Master under first
8. geo-replication logs reported errors [2] and session went to faulty

Step 6 to 8 are expected, moving this bug to verified state. 

[1]: 

[2017-07-17 11:16:14.433830] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '031e1a47-37d4-4a47-961f-6c43893c982e', 'gid': 0, 'mode': 33188, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/f4', 'op': 'CREATE'}, 2)
[2017-07-17 11:16:14.436542] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '11c5ea70-c508-479c-baf2-2112ddb8cec2', 'gid': 0, 'mode': 33188, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/f6', 'op': 'CREATE'}, 2)
[2017-07-17 11:16:14.455935] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: META FAILED: ({'go': '.gfid/11c5ea70-c508-479c-baf2-2112ddb8cec2', 'stat': {'atime': 1500290163.826682, 'gid': 0, 'mtime': 1500290163.826682, 'mode': 33188, 'uid': 0}, 'op': 'META'}, 2)
[2017-07-17 11:16:14.456257] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: META FAILED: ({'go': '.gfid/031e1a47-37d4-4a47-961f-6c43893c982e', 'stat': {'atime': 1500290160.922627, 'gid': 0, 'mtime': 1500290160.922627, 'mode': 33188, 'uid': 0}, 'op': 'META'}, 2)
[2017-07-17 11:24:02.33834] I [master(/rhs/brick1/b1):1125:crawl] _GMaster: slave's time: (1500290173, 0)


[2]:

[2017-07-17 11:24:02.57640] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': 'e9cab0d2-e777-4089-b6fa-ec886c0c929d', 'gid': 0, 'mode': 16877, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/test', 'op': 'MKDIR'}, 2)
[2017-07-17 11:24:02.58017] E [syncdutils(/rhs/brick1/b1):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2017-07-17 11:24:02.58984] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-07-17 11:24:02.65656] I [repce(/rhs/brick1/b1):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-07-17 11:24:02.66196] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-07-17 11:24:02.86541] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-07-17 11:24:12.277601] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.37.105:gluster://localhost:slave
[2017-07-17 11:24:12.389444] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-07-17 11:24:12.389770] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining...
[2017-07-17 11:24:18.189002] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 5.7992 secs

Comment 9 errata-xmlrpc 2017-09-21 04:33:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 10 errata-xmlrpc 2017-09-21 04:57:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774