Bug 1339163

Summary:	[geo-rep]: Monitor crashed with [Errno 3] No such process
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	geo-replication	Assignee:	Aravinda VK <avishwan>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, avishwan, csaba, rcyriac
Target Milestone:	---	Keywords:	Regression, ZStream
Target Release:	RHGS 3.1.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1339472 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:24:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1311817, 1339472, 1341068, 1341069

Description Rahul Hinduja 2016-05-24 09:43:18 UTC

Description of problem:
=======================

While Monitor was aborting the worker, it crashed as:

[2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
[2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
    slave_host, master)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor
    os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process



In ideal scenario monitor process should never go down. If worker dies it kills agent and monitor restarts both. If agent dies, then monitor kills worker and restarts both. 

Whereas in this case, the agent died and monitor tried to abort worker where it crashed. 

Georep session will remain in stopped state until restarted again. 

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64
glusterfs-3.7.9-5.el7rhgs.x86_64

How reproducible:
=================
Happened to see this once during automated regression test suite. 


Steps to Reproduce:
===================
Will work on the steps and update BZ. In general the scenario would be:

=> Kill agent and monitor logs, where monitor tries to abort worker. 



Additional info:

Comment 3 Aravinda VK 2016-05-25 06:50:42 UTC

Upstream patch sent
http://review.gluster.org/#/c/14512/

Comment 8 Aravinda VK 2016-05-31 09:26:20 UTC

Downstream patch 
https://code.engineering.redhat.com/gerrit/#/c/75474/

Comment 10 Rahul Hinduja 2016-06-01 13:45:12 UTC

Verified with build: glusterfs-3.7.9-7

Steps to reproduce:
===================

Start Geo-Rep session
Immediately kill worker and then agent


With build: glusterfs-3.7.9.6
++++++++++++++++++++++++++++++

[root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat stop
Stopping geo-replication session between red & 10.70.37.213::hat has been successful
[root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat start
Starting geo-replication session between red & 10.70.37.213::hat has been successful
[root@dhcp37-88 scripts]# 


[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback  | awk {'print $2'} | xargs kill -9 

Usage:
 kill [options] <pid|name> [...]

Options:
 -a, --all              do not restrict the name-to-pid conversion to processes
                        with the same uid as the present process
 -s, --signal <sig>     send specified signal
 -q, --queue <sig>      use sigqueue(2) rather than kill(2)
 -p, --pid              print pids without signaling them
 -l, --list [=<signal>] list signal names, or convert one to a name
 -L, --table            list signal names and numbers

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback  | awk {'print $2'} | xargs kill -9 
[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep agent  | awk {'print $2'} | xargs kill -9 

Usage:
 kill [options] <pid|name> [...]

Options:
 -a, --all              do not restrict the name-to-pid conversion to processes
                        with the same uid as the present process
 -s, --signal <sig>     send specified signal
 -q, --queue <sig>      use sigqueue(2) rather than kill(2)
 -p, --pid              print pids without signaling them
 -l, --list [=<signal>] list signal names, or convert one to a name
 -L, --table            list signal names and numbers

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
[root@dhcp37-43 scripts]#

[2016-06-01 07:47:11.459875] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2)
[2016-06-01 07:47:11.460282] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
    slave_host, master)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor
    os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process
[2016-06-01 07:47:11.471770] I [syncdutils(monitor):220:finalize] <top>: exiting.
[root@dhcp37-43 scripts]# 


With build: glusterfs-3.7.9.7
++++++++++++++++++++++++++++++

Carried the same test and didn't obser monitor crash. Logs are as follows, where agent died and monitor tried to abort worker, but worker died in startup phase. Monitor restarted worker, and in process monitor didn't crash:

[2016-06-01 13:35:21.559650] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2)
[2016-06-01 13:35:21.562530] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick2/b4)
[2016-06-01 13:35:21.563463] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick1/b2) died in startup phase
[2016-06-01 13:35:21.571918] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick2/b4) died in startup phase
[2016-06-01 13:35:31.761452] I [monitor(monitor):73:get_slave_bricks_status] <top>: Unable to get list of up nodes of Debt, returning empty list: Another transaction is in progress for Debt. Please try again after sometime.
[2016-06-01 13:35:31.765465] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------
[2016-06-01 13:35:31.765834] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker
[2016-06-01 13:35:31.770814] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------
[2016-06-01 13:35:31.772129] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker
[2016-06-01 13:35:31.890531] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining...
[2016-06-01 13:35:31.891576] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining...
[2016-06-01 13:35:31.893448] I [gsyncd(/rhs/brick1/b2):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt
[2016-06-01 13:35:31.903400] I [gsyncd(/rhs/brick2/b4):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt
[2016-06-01 13:35:34.767489] I [master(/rhs/brick1/b2):83:gmaster_builder] <top>: setting up xsync change detection mode
[2016-06-01 13:35:34.767644] I [master(/rhs/brick2/b4):83:gmaster_builder] <top>: setting up xsync change detection mode


Moving this BZ to verified state.

Comment 12 errata-xmlrpc 2016-06-23 05:24:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240