+++ This bug was initially created as a clone of Bug #1339163 +++ Description of problem: ======================= While Monitor was aborting the worker, it crashed as: [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0) [2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon slave_host, master) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor os.kill(cpid, signal.SIGKILL) OSError: [Errno 3] No such process In ideal scenario monitor process should never go down. If worker dies it kills agent and monitor restarts both. If agent dies, then monitor kills worker and restarts both. Whereas in this case, the agent died and monitor tried to abort worker where it crashed. Georep session will remain in stopped state until restarted again. Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64 glusterfs-3.7.9-5.el7rhgs.x86_64 How reproducible: ================= Happened to see this once during automated regression test suite. Steps to Reproduce: =================== Will work on the steps and update BZ. In general the scenario would be: => Kill agent and monitor logs, where monitor tries to abort worker.
REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#2) for review on master by Aravinda VK (avishwan)
REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#3) for review on master by Aravinda VK (avishwan)
REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#4) for review on master by Aravinda VK (avishwan)
REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#5) for review on master by Aravinda VK (avishwan)
COMMIT: http://review.gluster.org/14512 committed in master by Aravinda VK (avishwan) ------ commit 4f4a94a35a24d781f3f0e584a8cb59c019e50d6f Author: Aravinda VK <avishwan> Date: Tue May 24 14:13:29 2016 +0530 geo-rep: Handle Worker kill gracefully if worker already died If Agent dies for any reason, monitor tries to kill Worker also. But if worker is also died then kill command raises error ESRCH: No such process. [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0) [2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306 in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon slave_host, master) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor os.kill(cpid, signal.SIGKILL) OSError: [Errno 3] No such process With this patch, monitor will gracefully handle if worker is already died. Change-Id: I3ae5f816a3a197343b64540cf46f5453167fb660 Signed-off-by: Aravinda VK <avishwan> BUG: 1339472 Reviewed-on: http://review.gluster.org/14512 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Kotresh HR <khiremat> CentOS-regression: Gluster Build System <jenkins.com>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report. glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html [2] https://www.gluster.org/pipermail/gluster-users/