Description of problem: ========================= RMDIR at master causing workers to crash consistently causing rmdirs to be not synced to slave. Worker Crash: [2017-07-18 12:44:07.840857] I [master(/bricks/brick0/master_brick0):1130:crawl] _GMaster: slave's time: (1500381620, 0) [2017-07-18 12:44:07.844522] E [syncdutils(/bricks/brick0/master_brick0):296:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1141, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1116, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 999, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 826, in process_change if ty in ['RMDIR'] and not isinstance(st, int): UnboundLocalError: local variable 'st' referenced before assignment [2017-07-18 12:44:07.847419] I [syncdutils(/bricks/brick0/master_brick0):237:finalize] <top>: exiting. [2017-07-18 12:44:07.853607] I [repce(/bricks/brick0/master_brick0):92:service_loop] RepceServer: terminating on reaching EOF. [2017-07-18 12:44:07.854031] I [syncdutils(/bricks/brick0/master_brick0):237:finalize] <top>: exiting. [2017-07-18 12:44:07.875060] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty [2017-07-18 12:44:18.97970] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/bricks/brick0/master_brick0). Slave node: ssh://root.37.105:gluster://localhost:slave [2017-07-18 12:44:18.224348] I [resource(/bricks/brick0/master_brick0):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-07-18 12:44:18.246416] I [changelogagent(/bricks/brick0/master_brick0):73:__init__] ChangelogAgent: Agent listining... [2017-07-18 12:44:24.149849] I [resource(/bricks/brick0/master_brick0):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 5.9251 secs Version-Release number of selected component (if applicable): ============================================================== glusterfs-geo-replication-3.8.4-34.el6rhs.x86_64 glusterfs-geo-replication-3.8.4-34.el7rhgs.x86_64 How reproducible: ============= Always Steps to Reproduce: ==================== It was seen during automation run which does rmdir at master. Actual results: ============== Worker crashed on rmdir Expected results: ================ Worker should not crash Additional info: ============ Seen on RHEL7 and RHEL6 both
Analysis: With bug [1], we are checking for the directory presence in master. If the directory is present, geo-rep won't proceed with RMDIR. The stat was already being done as part of [2] in upstream and the patch [3] for bug [1] just used the existing stat information. But the patch [2] is not taken into downstream but patch [3] is taken into downstream causing this issue. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1468186' [2]: http://review.gluster.org/15110 [3]: https://review.gluster.org/#/c/17695 Solution: From above analysis, the fix to this bug is to add that stat and is DOWNSTREAM_ONLY
Downstream Only Patch: https://code.engineering.redhat.com/gerrit/#/c/112794/
RMDIR cases passed with build : glusterfs-geo-replication-3.8.4-35.el7rhgs.x86_64 No worker crashes have been seen. Moving this bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774