Bug 1002468 - Dist-geo-rep : geo-rep monitor process crashed in set_term_handler after worker crashed with IO error.
Dist-geo-rep : geo-rep monitor process crashed in set_term_handler after work...
Status: CLOSED DUPLICATE of bug 1044420
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Aravinda VK
: ZStream
Depends On:
  Show dependency treegraph
Reported: 2013-08-29 05:36 EDT by Vijaykumar Koppad
Modified: 2015-05-12 14:18 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2014-12-29 01:18:23 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Vijaykumar Koppad 2013-08-29 05:36:31 EDT
Description of problem: geo-rep monitor crashed in set_term_handler after worker crashed with IO error. 

backtrace was like 

[2013-08-29 13:00:53.777460] E [syncdutils(/bricks/brick1):206:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 133, in main
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 513, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1062, in service_loop
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 369, in crawlwrap
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 783, in crawl
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 744, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 726, in process_change
    if self.syncdata(datas):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 623, in syncdata
    if self.wait(self.FLAT_DIR_HIERARCHY, None):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 545, in wait
    ret = j[-1]()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 527, in <lambda>
    self.jobtab[path].append((label, a, lambda : job(*a, **kw)))
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 617, in regjob
    st = lstat(se)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 392, in lstat

    return os.lstat(e)
OSError: [Errno 5] Input/output error: '.gfid/39288d7f-ca78-40a6-96c9-ae22b4cd2e40'
[2013-08-29 13:00:53.778892] I [syncdutils(/bricks/brick1):158:finalize] <top>: exiting.
[2013-08-29 13:00:53.785870] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2013-08-29 13:01:03.798213] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-08-29 13:01:03.798502] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-08-29 13:01:03.897695] I [gsyncd(/bricks/brick1):503:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@
[2013-08-29 13:02:03.859800] E [syncdutils(monitor):206:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 232, in twrap
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 298, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2013-08-29 13:02:03.860609] I [syncdutils(monitor):158:finalize] <top>: exiting.


Version-Release number of selected component (if applicable):glusterfs-

How reproducible: Didn't try to reproduce

Steps to Reproduce:
These are the steps which I did before hitting this crash
1.create and start a geo-rep relationship between master(Dist-rep) and slave(dist-rep). 
2.create files on the master as non-root user using command "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10  /mnt/master/"and let it sync to slave  
3.then do chmod, chown , and chgrp to all the files,using commands  "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 --fop=chmod  /mnt/master/", "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 --fop=chown  /mnt/master/", "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 --fop=chgrp  /mnt/master/"

4. Then do truncate of all the files on master using the command "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 --fop=truncate  /mnt/master/"

Actual results: monitor crashed at "set_term_handler" and geo-rep status became defunct 

Expected results: monitor shouldn't crash.

Additional info:
Comment 2 Venky Shankar 2013-09-02 09:45:24 EDT
the monitor crash is a side effect of the worker process dying, which obviously should not happen as the monitor is responsible to restart the worker if it gets killed (here it should be within a minute, as that's the time interval monitor checks for worker correctness).

The problem here is that the signalling mechanism is not working as expected as the monitor process in multi-threaded, ie. one thread for each worker on a given node.
Comment 6 Aravinda VK 2014-12-29 01:18:23 EST

*** This bug has been marked as a duplicate of bug 1044420 ***

Note You need to log in before you can comment on or make changes to this bug.