Bug 764468 (GLUSTER-2736)

Summary: gsyncd hangs if crash occurs in the non-main thread
Product: [Community] GlusterFS Reporter: Csaba Henk <csaba>
Component: geo-replicationAssignee: Csaba Henk <csaba>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: mainlineCC: gluster-bugs, lakshmipathi, rahulcs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTNR Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Csaba Henk 2011-04-13 05:46:31 UTC
If some error condition occurs upon which gsyncd should terminate, but it appears in the non-main thread, it just left there hanging, because exceptions are thread local.

Simplest way of provoking it is just to pass a bogus slave url.

This possibility defeats our monitoring strategy, given that the monitor process will think the worker is fine as long as it's running. So we must force an exit upon whatever uncaught exception.

Comment 1 Anand Avati 2011-04-13 08:43:08 UTC
PATCH: http://patches.gluster.com/patch/6852 in master (syncdaemon: force termination for unhandled exception in any thread)

Comment 2 Lakshmipathi G 2011-04-15 06:44:22 UTC
(In reply to comment #0)
> Simplest way of provoking it is just to pass a bogus slave url.

started gsyncd with some invalid slave :

# gluster volume gsync start :slave root.com:/dir_not_exists
gsync started

# gluster volume gsync status :slave 
Gsync Status:
Master::slave                 Slave:ssh://root.149.250:file:///dir_not_exists    Status:OK

Comment 3 Anand Avati 2011-04-16 08:10:35 UTC
PATCH: http://patches.gluster.com/patch/6906 in master (syncdaemon: yet another try to exit properly)

Comment 4 Anand Avati 2011-04-17 11:39:17 UTC
PATCH: http://patches.gluster.com/patch/6928 in master (syncdaemon: minor cleanups on termination)

Comment 5 Csaba Henk 2011-04-18 05:38:28 UTC
this still comes if
- ssh slave, ssh hangs
- gluster slave, volume not started

Comment 6 Csaba Henk 2011-04-18 09:20:04 UTC
*** Bug 2789 has been marked as a duplicate of this bug. ***

Comment 7 Anand Avati 2011-04-19 06:30:07 UTC
PATCH: http://patches.gluster.com/patch/6956 in master (syncdaemon: fix swallowing of exit value)

Comment 8 Anand Avati 2011-04-22 07:52:50 UTC
PATCH: http://patches.gluster.com/patch/7029 in master (syncdaemon: have the monitor kill the worker if it does not connect in 60 sec)

Comment 9 Csaba Henk 2011-04-28 11:06:09 UTC
(In reply to comment #5)
> this still comes if
> - ssh slave, ssh hangs
> - gluster slave, volume not started

For some reason now the second case does not bring this up, but causes an immediate crash in worker.

So to verify it please use an ssh slave which hangs when you ssh to it. Simplest way is to ping google.com or microsoft.com, take the IP you see in ping, and specify an ssh url with that IP as host for slave (NB. using {google,microsoft}.com unresolved is not suggested because they resolve to multiple IP-s, it varies which one you get on a DNS lookup).

Altenatively, if you don't wan to provoke the big guys, just use any ssh slave but send a SIGSTOP to the main (root) sshd process before starting geo-replication.

Comment 10 Lakshmipathi G 2011-09-13 09:47:00 UTC
Tested with 3.2.3. created a volume & started gsyncd with google ipaddress. gsyncd doesn't hang when crash occurs in non-main thread. 

Turn off sshd on local machine and started gsyncd .
Log displayed following crash message until sshd started again.

[2011-09-13 00:41:01.375370] I [monitor(monitor):42:monitor] Monitor: ------------------------------------------------------------
[2011-09-13 00:41:01.375754] I [monitor(monitor):43:monitor] Monitor: starting gsyncd worker
[2011-09-13 00:41:01.430033] I [gsyncd:286:main_i] <top>: syncing: gluster://localhost:gsync -> ssh://root.1.108:/slave44
[2011-09-13 00:41:01.767729] E [syncdutils:131:exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 152, in twrap
    tf(*aa)
  File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/repce.py", line 118, in listen
    rid, exc, res = recv(self.inf)
  File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/repce.py", line 42, in recv
    return pickle.load(inf)
EOFError

-----
started sshd here - files are now synced with slave.
----
[2011-09-13 00:41:12.768661] I [monitor(monitor):42:monitor] Monitor: ------------------------------------------------------------
[2011-09-13 00:41:12.769083] I [monitor(monitor):43:monitor] Monitor: starting gsyncd worker
[2011-09-13 00:41:12.821748] I [gsyncd:286:main_i] <top>: syncing: gluster://localhost:gsync -> ssh://root.1.108:/slave44
[2011-09-13 00:41:21.718055] I [master:181:crawl] GMaster: new master is cd508b0d-aee0-4802-806d-0636634ad934
[2011-09-13 00:41:21.718314] I [master:187:crawl] GMaster: primary master with volume id cd508b0d-aee0-4802-806d-0636634ad934 ...
[2011-09-13 00:41:28.331437] I [master:170:crawl] GMaster: ... done, took 6.644087 seconds