Bug 1112582

Summary:	Dist-geo-rep : worker was not restarted by monitor, after it died, and remained in zombie state.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vijaykumar Koppad <vkoppad>
Component:	geo-replication	Assignee:	Aravinda VK <avishwan>
Status:	CLOSED ERRATA	QA Contact:	Bhaskar Bandari <bbandari>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.0	CC:	aavati, avishwan, bbandari, csaba, david.macdonald, nlevinki, nsathyan, ssamanta, vagarwal
Target Milestone:	---
Target Release:	RHGS 3.0.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.6.0.24-1	Doc Type:	Known Issue
Doc Text:	Cause: If geo-rep worker dies in the initial phase of establishing connection to slave, worker process becomes defunct and geo-rep monitor will not start the worker again. Consequence: Files will not be synced from that node. Workaround (if any): Restart the geo-rep session. Result:	Story Points:	---
Clone Of:
Clones:	1114003 (view as bug list)		Environment:
Last Closed:	2014-09-22 19:42:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1114003

Description Vijaykumar Koppad 2014-06-24 09:19:36 UTC

Description of problem:  worker didn't restart after it was aborted due to not confirming in 60s , and remained in zombie state. But corresponding agent is running even though worker has died. This actually results in files from corresponding brick not getting synced to slave.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-06-24 11:36:30.589910] I [master(/bricks/brick3/master_b9):452:crawlwrap] _GMaster: primary master with volume id c8712e0b-d171-4812-bda4-b8ad9f1032a3 ...
[2014-06-24 11:36:30.604903] I [master(/bricks/brick3/master_b9):463:crawlwrap] _GMaster: crawl interval: 3 seconds
[2014-06-24 11:37:23.568633] I [monitor(monitor):225:monitor] Monitor: worker(/bricks/brick1/master_b1) not confirmed in 60 sec, aborting it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-3.6.0.22-1.el6rhs


How reproducible: Didn't try to reproduce the issue,


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.create data on master.
3.After this, above issue can happen anytime, not sure exactly when it can happen

Actual results: Worker was not restarted by monitor after it died.


Expected results: Whenever worker dies, it should be restarted by monitor. 

Additional info:

Comment 2 Aravinda VK 2014-06-27 12:36:46 UTC

Upstream patch sent: http://review.gluster.org/#/c/8194/

Comment 3 Aravinda VK 2014-06-30 11:41:58 UTC

Downstream patch sent for review: https://code.engineering.redhat.com/gerrit/#/c/28127/

Comment 4 Vijaykumar Koppad 2014-07-22 09:16:19 UTC

verified on the build glusterfs-3.6.0.25-1.

Comment 8 errata-xmlrpc 2014-09-22 19:42:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Comment 9 Aravinda VK 2015-03-11 17:47:37 UTC

*** Bug 1114969 has been marked as a duplicate of this bug. ***