Bug 1112582

Summary: Dist-geo-rep : worker was not restarted by monitor, after it died, and remained in zombie state.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED ERRATA QA Contact: Bhaskar Bandari <bbandari>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: aavati, avishwan, bbandari, csaba, david.macdonald, nlevinki, nsathyan, ssamanta, vagarwal
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.6.0.24-1 Doc Type: Known Issue
Doc Text:
Cause: If geo-rep worker dies in the initial phase of establishing connection to slave, worker process becomes defunct and geo-rep monitor will not start the worker again. Consequence: Files will not be synced from that node. Workaround (if any): Restart the geo-rep session. Result:
Story Points: ---
Clone Of:
: 1114003 (view as bug list) Environment:
Last Closed: 2014-09-22 19:42:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1114003    

Description Vijaykumar Koppad 2014-06-24 09:19:36 UTC
Description of problem:  worker didn't restart after it was aborted due to not confirming in 60s , and remained in zombie state. But corresponding agent is running even though worker has died. This actually results in files from corresponding brick not getting synced to slave.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-06-24 11:36:30.589910] I [master(/bricks/brick3/master_b9):452:crawlwrap] _GMaster: primary master with volume id c8712e0b-d171-4812-bda4-b8ad9f1032a3 ...
[2014-06-24 11:36:30.604903] I [master(/bricks/brick3/master_b9):463:crawlwrap] _GMaster: crawl interval: 3 seconds
[2014-06-24 11:37:23.568633] I [monitor(monitor):225:monitor] Monitor: worker(/bricks/brick1/master_b1) not confirmed in 60 sec, aborting it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-3.6.0.22-1.el6rhs


How reproducible: Didn't try to reproduce the issue,


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.create data on master.
3.After this, above issue can happen anytime, not sure exactly when it can happen

Actual results: Worker was not restarted by monitor after it died.


Expected results: Whenever worker dies, it should be restarted by monitor. 

Additional info:

Comment 2 Aravinda VK 2014-06-27 12:36:46 UTC
Upstream patch sent: http://review.gluster.org/#/c/8194/

Comment 3 Aravinda VK 2014-06-30 11:41:58 UTC
Downstream patch sent for review: https://code.engineering.redhat.com/gerrit/#/c/28127/

Comment 4 Vijaykumar Koppad 2014-07-22 09:16:19 UTC
verified on the build glusterfs-3.6.0.25-1.

Comment 8 errata-xmlrpc 2014-09-22 19:42:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Comment 9 Aravinda VK 2015-03-11 17:47:37 UTC
*** Bug 1114969 has been marked as a duplicate of this bug. ***