1339472 – [geo-rep]: Monitor crashed with [Errno 3] No such process

Bug 1339472 - [geo-rep]: Monitor crashed with [Errno 3] No such process

Summary: [geo-rep]: Monitor crashed with [Errno 3] No such process

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Aravinda VK
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1339163
Blocks:	1341068 1341069
TreeView+	depends on / blocked

Reported:	2016-05-25 06:46 UTC by Aravinda VK
Modified:	2017-03-27 18:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.9.0
Clone Of:	1339163
Clones:	1341068 1341069 (view as bug list)
Environment:
Last Closed:	2017-03-27 18:11:33 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Aravinda VK 2016-05-25 06:46:15 UTC

+++ This bug was initially created as a clone of Bug #1339163 +++

Description of problem:
=======================

While Monitor was aborting the worker, it crashed as:

[2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
[2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
    slave_host, master)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor
    os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process



In ideal scenario monitor process should never go down. If worker dies it kills agent and monitor restarts both. If agent dies, then monitor kills worker and restarts both. 

Whereas in this case, the agent died and monitor tried to abort worker where it crashed. 

Georep session will remain in stopped state until restarted again. 

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64
glusterfs-3.7.9-5.el7rhgs.x86_64

How reproducible:
=================
Happened to see this once during automated regression test suite. 


Steps to Reproduce:
===================
Will work on the steps and update BZ. In general the scenario would be:

=> Kill agent and monitor logs, where monitor tries to abort worker.

Comment 1 Vijay Bellur 2016-05-25 06:47:16 UTC

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#2) for review on master by Aravinda VK (avishwan)

Comment 2 Vijay Bellur 2016-05-27 07:09:55 UTC

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#3) for review on master by Aravinda VK (avishwan)

Comment 3 Vijay Bellur 2016-05-30 07:15:47 UTC

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#4) for review on master by Aravinda VK (avishwan)

Comment 4 Vijay Bellur 2016-05-30 10:24:25 UTC

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#5) for review on master by Aravinda VK (avishwan)

Comment 5 Vijay Bellur 2016-05-30 15:12:08 UTC

COMMIT: http://review.gluster.org/14512 committed in master by Aravinda VK (avishwan) 
------
commit 4f4a94a35a24d781f3f0e584a8cb59c019e50d6f
Author: Aravinda VK <avishwan>
Date:   Tue May 24 14:13:29 2016 +0530

    geo-rep: Handle Worker kill gracefully if worker already died
    
    If Agent dies for any reason, monitor tries to kill Worker also. But
    if worker is also died then kill command raises error ESRCH: No such
    process.
    
    [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor:
        Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
    [2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception]
        <top>: FAIL:
    Traceback (most recent call last):
      File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306 in
      twrap
        tf(*aa)
      File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in
      wmon
         slave_host, master)
      File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in
      monitor
         os.kill(cpid, signal.SIGKILL)
         OSError: [Errno 3] No such process
    
    With this patch, monitor will gracefully handle if worker is already died.
    
    Change-Id: I3ae5f816a3a197343b64540cf46f5453167fb660
    Signed-off-by: Aravinda VK <avishwan>
    BUG: 1339472
    Reviewed-on: http://review.gluster.org/14512
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Kotresh HR <khiremat>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 6 Shyamsundar 2017-03-27 18:11:33 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.