1472604 – [geo-rep]: RMDIR at master causing worker crash

Bug 1472604 - [geo-rep]: RMDIR at master causing worker crash

Summary: [geo-rep]: RMDIR at master causing worker crash

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Kotresh HR
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-07-19 05:56 UTC by Rochelle
Modified:	2017-09-21 05:02 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-35
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 05:02:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Rochelle 2017-07-19 05:56:34 UTC

Description of problem:
=========================

RMDIR at master causing workers to crash consistently causing rmdirs to be not synced to slave. 
 
Worker Crash:

[2017-07-18 12:44:07.840857] I [master(/bricks/brick0/master_brick0):1130:crawl] _GMaster: slave's time: (1500381620, 0)
[2017-07-18 12:44:07.844522] E [syncdutils(/bricks/brick0/master_brick0):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1141, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1116, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 999, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 826, in process_change
    if ty in ['RMDIR'] and not isinstance(st, int):
UnboundLocalError: local variable 'st' referenced before assignment
[2017-07-18 12:44:07.847419] I [syncdutils(/bricks/brick0/master_brick0):237:finalize] <top>: exiting.
[2017-07-18 12:44:07.853607] I [repce(/bricks/brick0/master_brick0):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-07-18 12:44:07.854031] I [syncdutils(/bricks/brick0/master_brick0):237:finalize] <top>: exiting.
[2017-07-18 12:44:07.875060] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-07-18 12:44:18.97970] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/bricks/brick0/master_brick0). Slave node: ssh://root.37.105:gluster://localhost:slave
[2017-07-18 12:44:18.224348] I [resource(/bricks/brick0/master_brick0):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-07-18 12:44:18.246416] I [changelogagent(/bricks/brick0/master_brick0):73:__init__] ChangelogAgent: Agent listining...
[2017-07-18 12:44:24.149849] I [resource(/bricks/brick0/master_brick0):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 5.9251 secs



Version-Release number of selected component (if applicable):
==============================================================

glusterfs-geo-replication-3.8.4-34.el6rhs.x86_64

glusterfs-geo-replication-3.8.4-34.el7rhgs.x86_64


How reproducible:
=============
Always

Steps to Reproduce:
====================

It was seen during automation run which does rmdir at master.

Actual results:
==============

Worker crashed on rmdir


Expected results:
================

Worker should not crash

Additional info:
============

Seen on RHEL7 and RHEL6 both

Comment 3 Kotresh HR 2017-07-19 06:47:59 UTC

Analysis:

With bug [1], we are checking for the directory presence in master. If
the directory is present, geo-rep won't proceed with RMDIR. The stat was
already being done as part of [2] in upstream and the patch [3] for bug [1] just used the existing stat information. But the patch [2] is not taken into downstream but patch [3] is taken into downstream causing this issue.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1468186'
[2]: http://review.gluster.org/15110
[3]: https://review.gluster.org/#/c/17695


Solution:
From above analysis, the fix to this bug is to add that stat and is DOWNSTREAM_ONLY

Comment 4 Kotresh HR 2017-07-19 08:44:05 UTC

Downstream Only Patch:
https://code.engineering.redhat.com/gerrit/#/c/112794/

Comment 7 Rochelle 2017-07-24 04:33:44 UTC

RMDIR cases passed with build : glusterfs-geo-replication-3.8.4-35.el7rhgs.x86_64

No worker crashes have been seen.
Moving this bug to verified.

Comment 9 errata-xmlrpc 2017-09-21 05:02:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.