Description of problem: ======================= While Monitor was aborting the worker, it crashed as: [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0) [2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon slave_host, master) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor os.kill(cpid, signal.SIGKILL) OSError: [Errno 3] No such process In ideal scenario monitor process should never go down. If worker dies it kills agent and monitor restarts both. If agent dies, then monitor kills worker and restarts both. Whereas in this case, the agent died and monitor tried to abort worker where it crashed. Georep session will remain in stopped state until restarted again. Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64 glusterfs-3.7.9-5.el7rhgs.x86_64 How reproducible: ================= Happened to see this once during automated regression test suite. Steps to Reproduce: =================== Will work on the steps and update BZ. In general the scenario would be: => Kill agent and monitor logs, where monitor tries to abort worker. Additional info:
Upstream patch sent http://review.gluster.org/#/c/14512/
Downstream patch https://code.engineering.redhat.com/gerrit/#/c/75474/
Verified with build: glusterfs-3.7.9-7 Steps to reproduce: =================== Start Geo-Rep session Immediately kill worker and then agent With build: glusterfs-3.7.9.6 ++++++++++++++++++++++++++++++ [root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat stop Stopping geo-replication session between red & 10.70.37.213::hat has been successful [root@dhcp37-88 scripts]# gluster volume geo-replication red 10.70.37.213::hat start Starting geo-replication session between red & 10.70.37.213::hat has been successful [root@dhcp37-88 scripts]# [root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback | awk {'print $2'} | xargs kill -9 Usage: kill [options] <pid|name> [...] Options: -a, --all do not restrict the name-to-pid conversion to processes with the same uid as the present process -s, --signal <sig> send specified signal -q, --queue <sig> use sigqueue(2) rather than kill(2) -p, --pid print pids without signaling them -l, --list [=<signal>] list signal names, or convert one to a name -L, --table list signal names and numbers -h, --help display this help and exit -V, --version output version information and exit For more details see kill(1). [root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep feedback | awk {'print $2'} | xargs kill -9 [root@dhcp37-43 scripts]# ps -eaf | grep gsync | grep agent | awk {'print $2'} | xargs kill -9 Usage: kill [options] <pid|name> [...] Options: -a, --all do not restrict the name-to-pid conversion to processes with the same uid as the present process -s, --signal <sig> send specified signal -q, --queue <sig> use sigqueue(2) rather than kill(2) -p, --pid print pids without signaling them -l, --list [=<signal>] list signal names, or convert one to a name -L, --table list signal names and numbers -h, --help display this help and exit -V, --version output version information and exit For more details see kill(1). [root@dhcp37-43 scripts]# [2016-06-01 07:47:11.459875] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2) [2016-06-01 07:47:11.460282] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon slave_host, master) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor os.kill(cpid, signal.SIGKILL) OSError: [Errno 3] No such process [2016-06-01 07:47:11.471770] I [syncdutils(monitor):220:finalize] <top>: exiting. [root@dhcp37-43 scripts]# With build: glusterfs-3.7.9.7 ++++++++++++++++++++++++++++++ Carried the same test and didn't obser monitor crash. Logs are as follows, where agent died and monitor tried to abort worker, but worker died in startup phase. Monitor restarted worker, and in process monitor didn't crash: [2016-06-01 13:35:21.559650] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick1/b2) [2016-06-01 13:35:21.562530] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/rhs/brick2/b4) [2016-06-01 13:35:21.563463] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick1/b2) died in startup phase [2016-06-01 13:35:21.571918] I [monitor(monitor):343:monitor] Monitor: worker(/rhs/brick2/b4) died in startup phase [2016-06-01 13:35:31.761452] I [monitor(monitor):73:get_slave_bricks_status] <top>: Unable to get list of up nodes of Debt, returning empty list: Another transaction is in progress for Debt. Please try again after sometime. [2016-06-01 13:35:31.765465] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------ [2016-06-01 13:35:31.765834] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker [2016-06-01 13:35:31.770814] I [monitor(monitor):266:monitor] Monitor: ------------------------------------------------------------ [2016-06-01 13:35:31.772129] I [monitor(monitor):267:monitor] Monitor: starting gsyncd worker [2016-06-01 13:35:31.890531] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining... [2016-06-01 13:35:31.891576] I [changelogagent(agent):73:__init__] ChangelogAgent: Agent listining... [2016-06-01 13:35:31.893448] I [gsyncd(/rhs/brick1/b2):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt [2016-06-01 13:35:31.903400] I [gsyncd(/rhs/brick2/b4):698:main_i] <top>: syncing: gluster://localhost:Tech -> ssh://root.37.52:gluster://localhost:Debt [2016-06-01 13:35:34.767489] I [master(/rhs/brick1/b2):83:gmaster_builder] <top>: setting up xsync change detection mode [2016-06-01 13:35:34.767644] I [master(/rhs/brick2/b4):83:gmaster_builder] <top>: setting up xsync change detection mode Moving this BZ to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240