Hide Forgot
If some error condition occurs upon which gsyncd should terminate, but it appears in the non-main thread, it just left there hanging, because exceptions are thread local. Simplest way of provoking it is just to pass a bogus slave url. This possibility defeats our monitoring strategy, given that the monitor process will think the worker is fine as long as it's running. So we must force an exit upon whatever uncaught exception.
PATCH: http://patches.gluster.com/patch/6852 in master (syncdaemon: force termination for unhandled exception in any thread)
(In reply to comment #0) > Simplest way of provoking it is just to pass a bogus slave url. started gsyncd with some invalid slave : # gluster volume gsync start :slave root.com:/dir_not_exists gsync started # gluster volume gsync status :slave Gsync Status: Master::slave Slave:ssh://root.149.250:file:///dir_not_exists Status:OK
PATCH: http://patches.gluster.com/patch/6906 in master (syncdaemon: yet another try to exit properly)
PATCH: http://patches.gluster.com/patch/6928 in master (syncdaemon: minor cleanups on termination)
this still comes if - ssh slave, ssh hangs - gluster slave, volume not started
*** Bug 2789 has been marked as a duplicate of this bug. ***
PATCH: http://patches.gluster.com/patch/6956 in master (syncdaemon: fix swallowing of exit value)
PATCH: http://patches.gluster.com/patch/7029 in master (syncdaemon: have the monitor kill the worker if it does not connect in 60 sec)
(In reply to comment #5) > this still comes if > - ssh slave, ssh hangs > - gluster slave, volume not started For some reason now the second case does not bring this up, but causes an immediate crash in worker. So to verify it please use an ssh slave which hangs when you ssh to it. Simplest way is to ping google.com or microsoft.com, take the IP you see in ping, and specify an ssh url with that IP as host for slave (NB. using {google,microsoft}.com unresolved is not suggested because they resolve to multiple IP-s, it varies which one you get on a DNS lookup). Altenatively, if you don't wan to provoke the big guys, just use any ssh slave but send a SIGSTOP to the main (root) sshd process before starting geo-replication.
Tested with 3.2.3. created a volume & started gsyncd with google ipaddress. gsyncd doesn't hang when crash occurs in non-main thread. Turn off sshd on local machine and started gsyncd . Log displayed following crash message until sshd started again. [2011-09-13 00:41:01.375370] I [monitor(monitor):42:monitor] Monitor: ------------------------------------------------------------ [2011-09-13 00:41:01.375754] I [monitor(monitor):43:monitor] Monitor: starting gsyncd worker [2011-09-13 00:41:01.430033] I [gsyncd:286:main_i] <top>: syncing: gluster://localhost:gsync -> ssh://root.1.108:/slave44 [2011-09-13 00:41:01.767729] E [syncdutils:131:exception] <top>: FAIL: Traceback (most recent call last): File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 152, in twrap tf(*aa) File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/repce.py", line 118, in listen rid, exc, res = recv(self.inf) File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/repce.py", line 42, in recv return pickle.load(inf) EOFError ----- started sshd here - files are now synced with slave. ---- [2011-09-13 00:41:12.768661] I [monitor(monitor):42:monitor] Monitor: ------------------------------------------------------------ [2011-09-13 00:41:12.769083] I [monitor(monitor):43:monitor] Monitor: starting gsyncd worker [2011-09-13 00:41:12.821748] I [gsyncd:286:main_i] <top>: syncing: gluster://localhost:gsync -> ssh://root.1.108:/slave44 [2011-09-13 00:41:21.718055] I [master:181:crawl] GMaster: new master is cd508b0d-aee0-4802-806d-0636634ad934 [2011-09-13 00:41:21.718314] I [master:187:crawl] GMaster: primary master with volume id cd508b0d-aee0-4802-806d-0636634ad934 ... [2011-09-13 00:41:28.331437] I [master:170:crawl] GMaster: ... done, took 6.644087 seconds