1112582 – Dist-geo-rep : worker was not restarted by monitor, after it died, and remained in zombie state.

Bug 1112582 - Dist-geo-rep : worker was not restarted by monitor, after it died, and remained in zombie state.

Summary: Dist-geo-rep : worker was not restarted by monitor, after it died, and remain...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Aravinda VK
QA Contact:	Bhaskar Bandari
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1114969 (view as bug list)
Depends On:
Blocks:	1114003
TreeView+	depends on / blocked

Reported:	2014-06-24 09:19 UTC by Vijaykumar Koppad
Modified:	2015-05-13 17:00 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.6.0.24-1
Doc Type:	Known Issue
Doc Text:	Cause: If geo-rep worker dies in the initial phase of establishing connection to slave, worker process becomes defunct and geo-rep monitor will not start the worker again. Consequence: Files will not be synced from that node. Workaround (if any): Restart the geo-rep session. Result:
Clone Of:
Clones:	1114003 (view as bug list)
Environment:
Last Closed:	2014-09-22 19:42:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:1278	0	normal	SHIPPED_LIVE	Red Hat Storage Server 3.0 bug fix and enhancement update	2014-09-22 23:26:55 UTC

Description Vijaykumar Koppad 2014-06-24 09:19:36 UTC

Description of problem:  worker didn't restart after it was aborted due to not confirming in 60s , and remained in zombie state. But corresponding agent is running even though worker has died. This actually results in files from corresponding brick not getting synced to slave.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-06-24 11:36:30.589910] I [master(/bricks/brick3/master_b9):452:crawlwrap] _GMaster: primary master with volume id c8712e0b-d171-4812-bda4-b8ad9f1032a3 ...
[2014-06-24 11:36:30.604903] I [master(/bricks/brick3/master_b9):463:crawlwrap] _GMaster: crawl interval: 3 seconds
[2014-06-24 11:37:23.568633] I [monitor(monitor):225:monitor] Monitor: worker(/bricks/brick1/master_b1) not confirmed in 60 sec, aborting it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-3.6.0.22-1.el6rhs


How reproducible: Didn't try to reproduce the issue,


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.create data on master.
3.After this, above issue can happen anytime, not sure exactly when it can happen

Actual results: Worker was not restarted by monitor after it died.


Expected results: Whenever worker dies, it should be restarted by monitor. 

Additional info:

Comment 2 Aravinda VK 2014-06-27 12:36:46 UTC

Upstream patch sent: http://review.gluster.org/#/c/8194/

Comment 3 Aravinda VK 2014-06-30 11:41:58 UTC

Downstream patch sent for review: https://code.engineering.redhat.com/gerrit/#/c/28127/

Comment 4 Vijaykumar Koppad 2014-07-22 09:16:19 UTC

verified on the build glusterfs-3.6.0.25-1.

Comment 8 errata-xmlrpc 2014-09-22 19:42:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Comment 9 Aravinda VK 2015-03-11 17:47:37 UTC

*** Bug 1114969 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.