Bug 1339163
| Summary: | [geo-rep]: Monitor crashed with [Errno 3] No such process | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> | |
| Component: | geo-replication | Assignee: | Aravinda VK <avishwan> | |
| Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.1 | CC: | amukherj, avishwan, csaba, rcyriac | |
| Target Milestone: | --- | Keywords: | Regression, ZStream | |
| Target Release: | RHGS 3.1.3 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.7.9-7 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1339472 (view as bug list) | Environment: | ||
| Last Closed: | 2016-06-23 05:24:10 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1311817, 1339472, 1341068, 1341069 | |||
Upstream patch sent http://review.gluster.org/#/c/14512/ Downstream patch https://code.engineering.redhat.com/gerrit/#/c/75474/ Verified with build: glusterfs-3.7.9-7
Steps to reproduce:
===================
Start Geo-Rep session
Immediately kill worker and then agent
With build: glusterfs-3.7.9.6
++++++++++++++++++++++++++++++
[root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat stop
Stopping geo-replication session between red & 10.70.37.213::hat has been successful
[root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat start
Starting geo-replication session between red & 10.70.37.213::hat has been successful
[root@dhcp37-88 scripts]#
[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback | awk {'print $2'} | xargs kill -9
Usage:
kill [options] <pid|name> [...]
Options:
-a, --all do not restrict the name-to-pid conversion to processes
with the same uid as the present process
-s, --signal <sig> send specified signal
-q, --queue <sig> use sigqueue(2) rather than kill(2)
-p, --pid print pids without signaling them
-l, --list [=<signal>] list signal names, or convert one to a name
-L, --table list signal names and numbers
-h, --help display this help and exit
-V, --version output version information and exit
For more details see kill(1).
[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback | awk {'print $2'} | xargs kill -9
[root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep agent | awk {'print $2'} | xargs kill -9
Usage:
kill [options] <pid|name> [...]
Options:
-a, --all do not restrict the name-to-pid conversion to processes
with the same uid as the present process
-s, --signal <sig> send specified signal
-q, --queue <sig> use sigqueue(2) rather than kill(2)
-p, --pid print pids without signaling them
-l, --list [=<signal>] list signal names, or convert one to a name
-L, --table list signal names and numbers
-h, --help display this help and exit
-V, --version output version information and exit
For more details see kill(1).
[root@dhcp37-43 scripts]#
[2016-06-01 07:47:11.459875] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2)
[2016-06-01 07:47:11.460282] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap
tf(*aa)
File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
slave_host, master)
File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor
os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process
[2016-06-01 07:47:11.471770] I [syncdutils(monitor):220:finalize] <top>: exiting.
[root@dhcp37-43 scripts]#
With build: glusterfs-3.7.9.7
++++++++++++++++++++++++++++++
Carried the same test and didn't obser monitor crash. Logs are as follows, where agent died and monitor tried to abort worker, but worker died in startup phase. Monitor restarted worker, and in process monitor didn't crash:
[2016-06-01 13:35:21.559650] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2)
[2016-06-01 13:35:21.562530] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick2/b4)
[2016-06-01 13:35:21.563463] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick1/b2) died in startup phase
[2016-06-01 13:35:21.571918] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick2/b4) died in startup phase
[2016-06-01 13:35:31.761452] I [monitor(monitor):73:get_slave_bricks_status] <top>: Unable to get list of up nodes of Debt, returning empty list: Another transaction is in progress for Debt. Please try again after sometime.
[2016-06-01 13:35:31.765465] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------
[2016-06-01 13:35:31.765834] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker
[2016-06-01 13:35:31.770814] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------
[2016-06-01 13:35:31.772129] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker
[2016-06-01 13:35:31.890531] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining...
[2016-06-01 13:35:31.891576] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining...
[2016-06-01 13:35:31.893448] I [gsyncd(/rhs/brick1/b2):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt
[2016-06-01 13:35:31.903400] I [gsyncd(/rhs/brick2/b4):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt
[2016-06-01 13:35:34.767489] I [master(/rhs/brick1/b2):83:gmaster_builder] <top>: setting up xsync change detection mode
[2016-06-01 13:35:34.767644] I [master(/rhs/brick2/b4):83:gmaster_builder] <top>: setting up xsync change detection mode
Moving this BZ to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240 |
Description of problem: ======================= While Monitor was aborting the worker, it crashed as: [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0) [2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon slave_host, master) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor os.kill(cpid, signal.SIGKILL) OSError: [Errno 3] No such process In ideal scenario monitor process should never go down. If worker dies it kills agent and monitor restarts both. If agent dies, then monitor kills worker and restarts both. Whereas in this case, the agent died and monitor tried to abort worker where it crashed. Georep session will remain in stopped state until restarted again. Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64 glusterfs-3.7.9-5.el7rhgs.x86_64 How reproducible: ================= Happened to see this once during automated regression test suite. Steps to Reproduce: =================== Will work on the steps and update BZ. In general the scenario would be: => Kill agent and monitor logs, where monitor tries to abort worker. Additional info: