Bug 1112582 - Dist-geo-rep : worker was not restarted by monitor, after it died, and remained in zombie state.
Summary: Dist-geo-rep : worker was not restarted by monitor, after it died, and remain...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: RHGS 3.0.0
Assignee: Aravinda VK
QA Contact: Bhaskar Bandari
URL:
Whiteboard:
: 1114969 (view as bug list)
Depends On:
Blocks: 1114003
TreeView+ depends on / blocked
 
Reported: 2014-06-24 09:19 UTC by Vijaykumar Koppad
Modified: 2015-05-13 17:00 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.6.0.24-1
Doc Type: Known Issue
Doc Text:
Cause: If geo-rep worker dies in the initial phase of establishing connection to slave, worker process becomes defunct and geo-rep monitor will not start the worker again. Consequence: Files will not be synced from that node. Workaround (if any): Restart the geo-rep session. Result:
Clone Of:
: 1114003 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:42:47 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description Vijaykumar Koppad 2014-06-24 09:19:36 UTC
Description of problem:  worker didn't restart after it was aborted due to not confirming in 60s , and remained in zombie state. But corresponding agent is running even though worker has died. This actually results in files from corresponding brick not getting synced to slave.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-06-24 11:36:30.589910] I [master(/bricks/brick3/master_b9):452:crawlwrap] _GMaster: primary master with volume id c8712e0b-d171-4812-bda4-b8ad9f1032a3 ...
[2014-06-24 11:36:30.604903] I [master(/bricks/brick3/master_b9):463:crawlwrap] _GMaster: crawl interval: 3 seconds
[2014-06-24 11:37:23.568633] I [monitor(monitor):225:monitor] Monitor: worker(/bricks/brick1/master_b1) not confirmed in 60 sec, aborting it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-3.6.0.22-1.el6rhs


How reproducible: Didn't try to reproduce the issue,


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.create data on master.
3.After this, above issue can happen anytime, not sure exactly when it can happen

Actual results: Worker was not restarted by monitor after it died.


Expected results: Whenever worker dies, it should be restarted by monitor. 

Additional info:

Comment 2 Aravinda VK 2014-06-27 12:36:46 UTC
Upstream patch sent: http://review.gluster.org/#/c/8194/

Comment 3 Aravinda VK 2014-06-30 11:41:58 UTC
Downstream patch sent for review: https://code.engineering.redhat.com/gerrit/#/c/28127/

Comment 4 Vijaykumar Koppad 2014-07-22 09:16:19 UTC
verified on the build glusterfs-3.6.0.25-1.

Comment 8 errata-xmlrpc 2014-09-22 19:42:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Comment 9 Aravinda VK 2015-03-11 17:47:37 UTC
*** Bug 1114969 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.