Bug 1339163 - [geo-rep]: Monitor crashed with [Errno 3] No such process
Summary: [geo-rep]: Monitor crashed with [Errno 3] No such process
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHGS 3.1.3
Assignee: Aravinda VK
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks: 1311817 1339472 1341068 1341069
TreeView+ depends on / blocked
 
Reported: 2016-05-24 09:43 UTC by Rahul Hinduja
Modified: 2016-06-23 05:24 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.7.9-7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1339472 (view as bug list)
Environment:
Last Closed: 2016-06-23 05:24:10 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1240 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.1 Update 3 2016-06-23 08:51:28 UTC

Description Rahul Hinduja 2016-05-24 09:43:18 UTC
Description of problem:
=======================

While Monitor was aborting the worker, it crashed as:

[2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
[2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
    slave_host, master)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor
    os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process



In ideal scenario monitor process should never go down. If worker dies it kills agent and monitor restarts both. If agent dies, then monitor kills worker and restarts both. 

Whereas in this case, the agent died and monitor tried to abort worker where it crashed. 

Georep session will remain in stopped state until restarted again. 

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64
glusterfs-3.7.9-5.el7rhgs.x86_64

How reproducible:
=================
Happened to see this once during automated regression test suite. 


Steps to Reproduce:
===================
Will work on the steps and update BZ. In general the scenario would be:

=> Kill agent and monitor logs, where monitor tries to abort worker. 



Additional info:

Comment 3 Aravinda VK 2016-05-25 06:50:42 UTC
Upstream patch sent
http://review.gluster.org/#/c/14512/

Comment 8 Aravinda VK 2016-05-31 09:26:20 UTC
Downstream patch 
https://code.engineering.redhat.com/gerrit/#/c/75474/

Comment 10 Rahul Hinduja 2016-06-01 13:45:12 UTC
Verified with build: glusterfs-3.7.9-7

Steps to reproduce:
===================

Start Geo-Rep session
Immediately kill worker and then agent


With build: glusterfs-3.7.9.6
++++++++++++++++++++++++++++++

[root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat stop
Stopping geo-replication session between red & 10.70.37.213::hat has been successful
[root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat start
Starting geo-replication session between red & 10.70.37.213::hat has been successful
[root@dhcp37-88 scripts]# 


[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback  | awk {'print $2'} | xargs kill -9 

Usage:
 kill [options] <pid|name> [...]

Options:
 -a, --all              do not restrict the name-to-pid conversion to processes
                        with the same uid as the present process
 -s, --signal <sig>     send specified signal
 -q, --queue <sig>      use sigqueue(2) rather than kill(2)
 -p, --pid              print pids without signaling them
 -l, --list [=<signal>] list signal names, or convert one to a name
 -L, --table            list signal names and numbers

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback  | awk {'print $2'} | xargs kill -9 
[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep agent  | awk {'print $2'} | xargs kill -9 

Usage:
 kill [options] <pid|name> [...]

Options:
 -a, --all              do not restrict the name-to-pid conversion to processes
                        with the same uid as the present process
 -s, --signal <sig>     send specified signal
 -q, --queue <sig>      use sigqueue(2) rather than kill(2)
 -p, --pid              print pids without signaling them
 -l, --list [=<signal>] list signal names, or convert one to a name
 -L, --table            list signal names and numbers

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
[root@dhcp37-43 scripts]#

[2016-06-01 07:47:11.459875] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2)
[2016-06-01 07:47:11.460282] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
    slave_host, master)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor
    os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process
[2016-06-01 07:47:11.471770] I [syncdutils(monitor):220:finalize] <top>: exiting.
[root@dhcp37-43 scripts]# 


With build: glusterfs-3.7.9.7
++++++++++++++++++++++++++++++

Carried the same test and didn't obser monitor crash. Logs are as follows, where agent died and monitor tried to abort worker, but worker died in startup phase. Monitor restarted worker, and in process monitor didn't crash:

[2016-06-01 13:35:21.559650] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2)
[2016-06-01 13:35:21.562530] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick2/b4)
[2016-06-01 13:35:21.563463] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick1/b2) died in startup phase
[2016-06-01 13:35:21.571918] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick2/b4) died in startup phase
[2016-06-01 13:35:31.761452] I [monitor(monitor):73:get_slave_bricks_status] <top>: Unable to get list of up nodes of Debt, returning empty list: Another transaction is in progress for Debt. Please try again after sometime.
[2016-06-01 13:35:31.765465] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------
[2016-06-01 13:35:31.765834] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker
[2016-06-01 13:35:31.770814] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------
[2016-06-01 13:35:31.772129] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker
[2016-06-01 13:35:31.890531] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining...
[2016-06-01 13:35:31.891576] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining...
[2016-06-01 13:35:31.893448] I [gsyncd(/rhs/brick1/b2):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt
[2016-06-01 13:35:31.903400] I [gsyncd(/rhs/brick2/b4):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt
[2016-06-01 13:35:34.767489] I [master(/rhs/brick1/b2):83:gmaster_builder] <top>: setting up xsync change detection mode
[2016-06-01 13:35:34.767644] I [master(/rhs/brick2/b4):83:gmaster_builder] <top>: setting up xsync change detection mode


Moving this BZ to verified state.

Comment 12 errata-xmlrpc 2016-06-23 05:24:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240


Note You need to log in before you can comment on or make changes to this bug.