| Summary: | gsyncd hangs if crash occurs in the non-main thread | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Csaba Henk <csaba> |
| Component: | geo-replication | Assignee: | Csaba Henk <csaba> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | mainline | CC: | gluster-bugs, lakshmipathi, rahulcs |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | RTNR | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Csaba Henk
2011-04-13 05:46:31 UTC
PATCH: http://patches.gluster.com/patch/6852 in master (syncdaemon: force termination for unhandled exception in any thread) (In reply to comment #0) > Simplest way of provoking it is just to pass a bogus slave url. started gsyncd with some invalid slave : # gluster volume gsync start :slave root.com:/dir_not_exists gsync started # gluster volume gsync status :slave Gsync Status: Master::slave Slave:ssh://root.149.250:file:///dir_not_exists Status:OK PATCH: http://patches.gluster.com/patch/6906 in master (syncdaemon: yet another try to exit properly) PATCH: http://patches.gluster.com/patch/6928 in master (syncdaemon: minor cleanups on termination) this still comes if - ssh slave, ssh hangs - gluster slave, volume not started *** Bug 2789 has been marked as a duplicate of this bug. *** PATCH: http://patches.gluster.com/patch/6956 in master (syncdaemon: fix swallowing of exit value) PATCH: http://patches.gluster.com/patch/7029 in master (syncdaemon: have the monitor kill the worker if it does not connect in 60 sec) (In reply to comment #5) > this still comes if > - ssh slave, ssh hangs > - gluster slave, volume not started For some reason now the second case does not bring this up, but causes an immediate crash in worker. So to verify it please use an ssh slave which hangs when you ssh to it. Simplest way is to ping google.com or microsoft.com, take the IP you see in ping, and specify an ssh url with that IP as host for slave (NB. using {google,microsoft}.com unresolved is not suggested because they resolve to multiple IP-s, it varies which one you get on a DNS lookup). Altenatively, if you don't wan to provoke the big guys, just use any ssh slave but send a SIGSTOP to the main (root) sshd process before starting geo-replication. Tested with 3.2.3. created a volume & started gsyncd with google ipaddress. gsyncd doesn't hang when crash occurs in non-main thread.
Turn off sshd on local machine and started gsyncd .
Log displayed following crash message until sshd started again.
[2011-09-13 00:41:01.375370] I [monitor(monitor):42:monitor] Monitor: ------------------------------------------------------------
[2011-09-13 00:41:01.375754] I [monitor(monitor):43:monitor] Monitor: starting gsyncd worker
[2011-09-13 00:41:01.430033] I [gsyncd:286:main_i] <top>: syncing: gluster://localhost:gsync -> ssh://root.1.108:/slave44
[2011-09-13 00:41:01.767729] E [syncdutils:131:exception] <top>: FAIL:
Traceback (most recent call last):
File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 152, in twrap
tf(*aa)
File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/repce.py", line 118, in listen
rid, exc, res = recv(self.inf)
File "/opt/glusterfs/3.2.3/local/libexec/glusterfs/python/syncdaemon/repce.py", line 42, in recv
return pickle.load(inf)
EOFError
-----
started sshd here - files are now synced with slave.
----
[2011-09-13 00:41:12.768661] I [monitor(monitor):42:monitor] Monitor: ------------------------------------------------------------
[2011-09-13 00:41:12.769083] I [monitor(monitor):43:monitor] Monitor: starting gsyncd worker
[2011-09-13 00:41:12.821748] I [gsyncd:286:main_i] <top>: syncing: gluster://localhost:gsync -> ssh://root.1.108:/slave44
[2011-09-13 00:41:21.718055] I [master:181:crawl] GMaster: new master is cd508b0d-aee0-4802-806d-0636634ad934
[2011-09-13 00:41:21.718314] I [master:187:crawl] GMaster: primary master with volume id cd508b0d-aee0-4802-806d-0636634ad934 ...
[2011-09-13 00:41:28.331437] I [master:170:crawl] GMaster: ... done, took 6.644087 seconds
|