Description of problem: worker didn't restart after it was aborted due to not confirming in 60s , and remained in zombie state. But corresponding agent is running even though worker has died. This actually results in files from corresponding brick not getting synced to slave. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2014-06-24 11:36:30.589910] I [master(/bricks/brick3/master_b9):452:crawlwrap] _GMaster: primary master with volume id c8712e0b-d171-4812-bda4-b8ad9f1032a3 ... [2014-06-24 11:36:30.604903] I [master(/bricks/brick3/master_b9):463:crawlwrap] _GMaster: crawl interval: 3 seconds [2014-06-24 11:37:23.568633] I [monitor(monitor):225:monitor] Monitor: worker(/bricks/brick1/master_b1) not confirmed in 60 sec, aborting it >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version-Release number of selected component (if applicable): glusterfs-3.6.0.22-1.el6rhs How reproducible: Didn't try to reproduce the issue, Steps to Reproduce: 1.create and start a geo-rep relationship between master and slave. 2.create data on master. 3.After this, above issue can happen anytime, not sure exactly when it can happen Actual results: Worker was not restarted by monitor after it died. Expected results: Whenever worker dies, it should be restarted by monitor. Additional info:
Upstream patch sent: http://review.gluster.org/#/c/8194/
Downstream patch sent for review: https://code.engineering.redhat.com/gerrit/#/c/28127/
verified on the build glusterfs-3.6.0.25-1.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html
*** Bug 1114969 has been marked as a duplicate of this bug. ***