Bug 1341069 - [geo-rep]: Monitor crashed with [Errno 3] No such process
Summary: [geo-rep]: Monitor crashed with [Errno 3] No such process
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: 3.8.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Aravinda VK
QA Contact:
URL:
Whiteboard:
Depends On: 1339163 1339472
Blocks: 1341068
TreeView+ depends on / blocked
 
Reported: 2016-05-31 08:11 UTC by Aravinda VK
Modified: 2016-06-16 12:32 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.8.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1339472
Environment:
Last Closed: 2016-06-16 12:32:47 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Aravinda VK 2016-05-31 08:11:56 UTC
+++ This bug was initially created as a clone of Bug #1339472 +++

+++ This bug was initially created as a clone of Bug #1339163 +++

Description of problem:
=======================

While Monitor was aborting the worker, it crashed as:

[2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
[2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
    slave_host, master)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor
    os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process



In ideal scenario monitor process should never go down. If worker dies it kills agent and monitor restarts both. If agent dies, then monitor kills worker and restarts both. 

Whereas in this case, the agent died and monitor tried to abort worker where it crashed. 

Georep session will remain in stopped state until restarted again. 

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64
glusterfs-3.7.9-5.el7rhgs.x86_64

How reproducible:
=================
Happened to see this once during automated regression test suite. 


Steps to Reproduce:
===================
Will work on the steps and update BZ. In general the scenario would be:

=> Kill agent and monitor logs, where monitor tries to abort worker.

--- Additional comment from Vijay Bellur on 2016-05-25 02:47:16 EDT ---

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#2) for review on master by Aravinda VK (avishwan)

--- Additional comment from Vijay Bellur on 2016-05-27 03:09:55 EDT ---

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#3) for review on master by Aravinda VK (avishwan)

--- Additional comment from Vijay Bellur on 2016-05-30 03:15:47 EDT ---

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#4) for review on master by Aravinda VK (avishwan)

--- Additional comment from Vijay Bellur on 2016-05-30 06:24:25 EDT ---

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#5) for review on master by Aravinda VK (avishwan)

--- Additional comment from Vijay Bellur on 2016-05-30 11:12:08 EDT ---

COMMIT: http://review.gluster.org/14512 committed in master by Aravinda VK (avishwan) 
------
commit 4f4a94a35a24d781f3f0e584a8cb59c019e50d6f
Author: Aravinda VK <avishwan>
Date:   Tue May 24 14:13:29 2016 +0530

    geo-rep: Handle Worker kill gracefully if worker already died
    
    If Agent dies for any reason, monitor tries to kill Worker also. But
    if worker is also died then kill command raises error ESRCH: No such
    process.
    
    [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor:
        Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
    [2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception]
        <top>: FAIL:
    Traceback (most recent call last):
      File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306 in
      twrap
        tf(*aa)
      File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in
      wmon
         slave_host, master)
      File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in
      monitor
         os.kill(cpid, signal.SIGKILL)
         OSError: [Errno 3] No such process
    
    With this patch, monitor will gracefully handle if worker is already died.
    
    Change-Id: I3ae5f816a3a197343b64540cf46f5453167fb660
    Signed-off-by: Aravinda VK <avishwan>
    BUG: 1339472
    Reviewed-on: http://review.gluster.org/14512
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Kotresh HR <khiremat>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 1 Vijay Bellur 2016-05-31 08:14:18 UTC
REVIEW: http://review.gluster.org/14563 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#1) for review on release-3.8 by Aravinda VK (avishwan)

Comment 2 Vijay Bellur 2016-06-13 13:25:33 UTC
REVIEW: http://review.gluster.org/14563 (geo-rep: Handle Worker kill gracefully if worker already died) posted (#2) for review on release-3.8 by Aravinda VK (avishwan)

Comment 3 Vijay Bellur 2016-06-13 13:47:04 UTC
COMMIT: http://review.gluster.org/14563 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 04e880a31f641659ebaf310898bbfc221d69e5fd
Author: Aravinda VK <avishwan>
Date:   Tue May 24 14:13:29 2016 +0530

    geo-rep: Handle Worker kill gracefully if worker already died
    
    If Agent dies for any reason, monitor tries to kill Worker also. But
    if worker is also died then kill command raises error ESRCH: No such
    process.
    
    [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor:
        Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
    [2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception]
        <top>: FAIL:
    Traceback (most recent call last):
      File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306 in
      twrap
        tf(*aa)
      File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in
      wmon
         slave_host, master)
      File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in
      monitor
         os.kill(cpid, signal.SIGKILL)
         OSError: [Errno 3] No such process
    
    With this patch, monitor will gracefully handle if worker is already died.
    
    > Change-Id: I3ae5f816a3a197343b64540cf46f5453167fb660
    > Signed-off-by: Aravinda VK <avishwan>
    > BUG: 1339472
    > Reviewed-on: http://review.gluster.org/14512
    > Smoke: Gluster Build System <jenkins.com>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > Reviewed-by: Kotresh HR <khiremat>
    > CentOS-regression: Gluster Build System <jenkins.com>
    (cherry picked from commit 4f4a94a35a24d781f3f0e584a8cb59c019e50d6f)
    
    Change-Id: I3ae5f816a3a197343b64540cf46f5453167fb660
    Signed-off-by: Aravinda VK <avishwan>
    BUG: 1341069
    Reviewed-on: http://review.gluster.org/14563
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Niels de Vos <ndevos>
    Smoke: Gluster Build System <jenkins.com>

Comment 4 Niels de Vos 2016-06-16 12:32:47 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.