Bug 1472604

Summary:	[geo-rep]: RMDIR at master causing worker crash
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rochelle <rallan>
Component:	geo-replication	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED ERRATA	QA Contact:	Rochelle <rallan>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, asrivast, csaba, khiremat, rhs-bugs, storage-qa-internal
Target Milestone:	---	Keywords:	Regression
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-35	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-09-21 05:02:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1417151

Description Rochelle 2017-07-19 05:56:34 UTC

Description of problem:
=========================

RMDIR at master causing workers to crash consistently causing rmdirs to be not synced to slave. 
 
Worker Crash:

[2017-07-18 12:44:07.840857] I [master(/bricks/brick0/master_brick0):1130:crawl] _GMaster: slave's time: (1500381620, 0)
[2017-07-18 12:44:07.844522] E [syncdutils(/bricks/brick0/master_brick0):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1141, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1116, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 999, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 826, in process_change
    if ty in ['RMDIR'] and not isinstance(st, int):
UnboundLocalError: local variable 'st' referenced before assignment
[2017-07-18 12:44:07.847419] I [syncdutils(/bricks/brick0/master_brick0):237:finalize] <top>: exiting.
[2017-07-18 12:44:07.853607] I [repce(/bricks/brick0/master_brick0):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-07-18 12:44:07.854031] I [syncdutils(/bricks/brick0/master_brick0):237:finalize] <top>: exiting.
[2017-07-18 12:44:07.875060] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-07-18 12:44:18.97970] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/bricks/brick0/master_brick0). Slave node: ssh://root.37.105:gluster://localhost:slave
[2017-07-18 12:44:18.224348] I [resource(/bricks/brick0/master_brick0):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-07-18 12:44:18.246416] I [changelogagent(/bricks/brick0/master_brick0):73:__init__] ChangelogAgent: Agent listining...
[2017-07-18 12:44:24.149849] I [resource(/bricks/brick0/master_brick0):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 5.9251 secs



Version-Release number of selected component (if applicable):
==============================================================

glusterfs-geo-replication-3.8.4-34.el6rhs.x86_64

glusterfs-geo-replication-3.8.4-34.el7rhgs.x86_64


How reproducible:
=============
Always

Steps to Reproduce:
====================

It was seen during automation run which does rmdir at master.

Actual results:
==============

Worker crashed on rmdir


Expected results:
================

Worker should not crash

Additional info:
============

Seen on RHEL7 and RHEL6 both

Comment 3 Kotresh HR 2017-07-19 06:47:59 UTC

Analysis:

With bug [1], we are checking for the directory presence in master. If
the directory is present, geo-rep won't proceed with RMDIR. The stat was
already being done as part of [2] in upstream and the patch [3] for bug [1] just used the existing stat information. But the patch [2] is not taken into downstream but patch [3] is taken into downstream causing this issue.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1468186'
[2]: http://review.gluster.org/15110
[3]: https://review.gluster.org/#/c/17695


Solution:
From above analysis, the fix to this bug is to add that stat and is DOWNSTREAM_ONLY

Comment 4 Kotresh HR 2017-07-19 08:44:05 UTC

Downstream Only Patch:
https://code.engineering.redhat.com/gerrit/#/c/112794/

Comment 7 Rochelle 2017-07-24 04:33:44 UTC

RMDIR cases passed with build : glusterfs-geo-replication-3.8.4-35.el7rhgs.x86_64

No worker crashes have been seen.
Moving this bug to verified.

Comment 9 errata-xmlrpc 2017-09-21 05:02:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774