+++ This bug was initially created as a clone of Bug #1411607 +++ Description of problem: If for some reason MKDIR failed to sync, it will log and proceed further. Allowing it to further process will fail the entire directory tree syncing to slave. Version-Release number of selected component (if applicable): mainline How reproducible: Always Steps to Reproduce: 1. Setup master and slave gluster volume 2. Setup geo-rep session between them 3. Introduce directory sync failure by some means. To manually introduce this error, delete a directory on slave and Create files and directories under the deleted directory on master. Actual results: Geo-rep logs the errors and proceeds further Expected results: Geo-rep should not proceed if it's a directory error. Additional info: --- Additional comment from Worker Ant on 2017-01-10 01:22:34 EST --- REVIEW: http://review.gluster.org/16364 (geo-rep: Handle directory sync failure as hard error) posted (#1) for review on master by Kotresh HR (khiremat) --- Additional comment from Worker Ant on 2017-01-13 01:55:18 EST --- REVIEW: http://review.gluster.org/16364 (geo-rep: Handle directory sync failure as hard error) posted (#2) for review on master by Kotresh HR (khiremat) --- Additional comment from Worker Ant on 2017-01-13 08:00:08 EST --- COMMIT: http://review.gluster.org/16364 committed in master by Aravinda VK (avishwan) ------ commit 91ad7fe0ed8e8ce8f5899bb5ebbbbe57ede7dd43 Author: Kotresh HR <khiremat> Date: Tue Jan 10 00:30:42 2017 -0500 geo-rep: Handle directory sync failure as hard error If directory creation is failed, return immediately before further processing. Allowing it to further process will fail the entire directory tree syncing to slave. Hence master will log and raise exception if it's directory failure. Earlier, master used to log the failure and proceed. Change-Id: Iba2a8b5d3d0092e7a9c8a3c2cdf9e6e29c73ddf0 BUG: 1411607 Signed-off-by: Kotresh HR <khiremat> Reviewed-on: http://review.gluster.org/16364 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Aravinda VK <avishwan>
Upstream Patch: http://review.gluster.org/16364 (master)
It is in upstream 3.10 as part of branch out from master.
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101291/
Verified this bug on the build: glusterfs-geo-replication-3.8.4-32.el7rhgs.x86_64 Tried the following scenario: 1. Create geo-replication between master and slave 2. Mounted master volume and created dir like "first" and "second" 3. Created few files inside first and second. All synced to slave properly 4. Deleted first at slave 5. Created few files at Master and first 6. geo-replication logs reported errors [1] but did not go to faulty 7. created a dir at Master under first 8. geo-replication logs reported errors [2] and session went to faulty Step 6 to 8 are expected, moving this bug to verified state. [1]: [2017-07-17 11:16:14.433830] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '031e1a47-37d4-4a47-961f-6c43893c982e', 'gid': 0, 'mode': 33188, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/f4', 'op': 'CREATE'}, 2) [2017-07-17 11:16:14.436542] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '11c5ea70-c508-479c-baf2-2112ddb8cec2', 'gid': 0, 'mode': 33188, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/f6', 'op': 'CREATE'}, 2) [2017-07-17 11:16:14.455935] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: META FAILED: ({'go': '.gfid/11c5ea70-c508-479c-baf2-2112ddb8cec2', 'stat': {'atime': 1500290163.826682, 'gid': 0, 'mtime': 1500290163.826682, 'mode': 33188, 'uid': 0}, 'op': 'META'}, 2) [2017-07-17 11:16:14.456257] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: META FAILED: ({'go': '.gfid/031e1a47-37d4-4a47-961f-6c43893c982e', 'stat': {'atime': 1500290160.922627, 'gid': 0, 'mtime': 1500290160.922627, 'mode': 33188, 'uid': 0}, 'op': 'META'}, 2) [2017-07-17 11:24:02.33834] I [master(/rhs/brick1/b1):1125:crawl] _GMaster: slave's time: (1500290173, 0) [2]: [2017-07-17 11:24:02.57640] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': 'e9cab0d2-e777-4089-b6fa-ec886c0c929d', 'gid': 0, 'mode': 16877, 'entry': '.gfid/cbf963a0-68c9-4a2d-b5fc-ecd428fdb89a/test', 'op': 'MKDIR'}, 2) [2017-07-17 11:24:02.58017] E [syncdutils(/rhs/brick1/b1):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further. [2017-07-17 11:24:02.58984] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting. [2017-07-17 11:24:02.65656] I [repce(/rhs/brick1/b1):92:service_loop] RepceServer: terminating on reaching EOF. [2017-07-17 11:24:02.66196] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting. [2017-07-17 11:24:02.86541] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty [2017-07-17 11:24:12.277601] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.37.105:gluster://localhost:slave [2017-07-17 11:24:12.389444] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-07-17 11:24:12.389770] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining... [2017-07-17 11:24:18.189002] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 5.7992 secs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774