Bug 1339163
Summary: | [geo-rep]: Monitor crashed with [Errno 3] No such process | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> | |
Component: | geo-replication | Assignee: | Aravinda VK <avishwan> | |
Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.1 | CC: | amukherj, avishwan, csaba, rcyriac | |
Target Milestone: | --- | Keywords: | Regression, ZStream | |
Target Release: | RHGS 3.1.3 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.7.9-7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1339472 (view as bug list) | Environment: | ||
Last Closed: | 2016-06-23 05:24:10 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1311817, 1339472, 1341068, 1341069 |
Description
Rahul Hinduja
2016-05-24 09:43:18 UTC
Upstream patch sent http://review.gluster.org/#/c/14512/ Downstream patch https://code.engineering.redhat.com/gerrit/#/c/75474/ Verified with build: glusterfs-3.7.9-7 Steps to reproduce: =================== Start Geo-Rep session Immediately kill worker and then agent With build: glusterfs-3.7.9.6 ++++++++++++++++++++++++++++++ [root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat stop Stopping geo-replication session between red & 10.70.37.213::hat has been successful [root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat start Starting geo-replication session between red & 10.70.37.213::hat has been successful [root@dhcp37-88 scripts]# [root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback | awk {'print $2'} | xargs kill -9 Usage: kill [options] <pid|name> [...] Options: -a, --all do not restrict the name-to-pid conversion to processes with the same uid as the present process -s, --signal <sig> send specified signal -q, --queue <sig> use sigqueue(2) rather than kill(2) -p, --pid print pids without signaling them -l, --list [=<signal>] list signal names, or convert one to a name -L, --table list signal names and numbers -h, --help display this help and exit -V, --version output version information and exit For more details see kill(1). [root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback | awk {'print $2'} | xargs kill -9 [root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep agent | awk {'print $2'} | xargs kill -9 Usage: kill [options] <pid|name> [...] Options: -a, --all do not restrict the name-to-pid conversion to processes with the same uid as the present process -s, --signal <sig> send specified signal -q, --queue <sig> use sigqueue(2) rather than kill(2) -p, --pid print pids without signaling them -l, --list [=<signal>] list signal names, or convert one to a name -L, --table list signal names and numbers -h, --help display this help and exit -V, --version output version information and exit For more details see kill(1). [root@dhcp37-43 scripts]# [2016-06-01 07:47:11.459875] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2) [2016-06-01 07:47:11.460282] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon slave_host, master) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor os.kill(cpid, signal.SIGKILL) OSError: [Errno 3] No such process [2016-06-01 07:47:11.471770] I [syncdutils(monitor):220:finalize] <top>: exiting. [root@dhcp37-43 scripts]# With build: glusterfs-3.7.9.7 ++++++++++++++++++++++++++++++ Carried the same test and didn't obser monitor crash. Logs are as follows, where agent died and monitor tried to abort worker, but worker died in startup phase. Monitor restarted worker, and in process monitor didn't crash: [2016-06-01 13:35:21.559650] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2) [2016-06-01 13:35:21.562530] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick2/b4) [2016-06-01 13:35:21.563463] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick1/b2) died in startup phase [2016-06-01 13:35:21.571918] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick2/b4) died in startup phase [2016-06-01 13:35:31.761452] I [monitor(monitor):73:get_slave_bricks_status] <top>: Unable to get list of up nodes of Debt, returning empty list: Another transaction is in progress for Debt. Please try again after sometime. [2016-06-01 13:35:31.765465] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------ [2016-06-01 13:35:31.765834] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker [2016-06-01 13:35:31.770814] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------ [2016-06-01 13:35:31.772129] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker [2016-06-01 13:35:31.890531] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining... [2016-06-01 13:35:31.891576] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining... [2016-06-01 13:35:31.893448] I [gsyncd(/rhs/brick1/b2):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt [2016-06-01 13:35:31.903400] I [gsyncd(/rhs/brick2/b4):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt [2016-06-01 13:35:34.767489] I [master(/rhs/brick1/b2):83:gmaster_builder] <top>: setting up xsync change detection mode [2016-06-01 13:35:34.767644] I [master(/rhs/brick2/b4):83:gmaster_builder] <top>: setting up xsync change detection mode Moving this BZ to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240 |