Description of problem: geo-rep monitor crashed in set_term_handler after worker crashed with IO error. backtrace was like >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-08-29 13:00:53.777460] E [syncdutils(/bricks/brick1):206:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 133, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 513, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1062, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 369, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 783, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 744, in process if self.process_change(change, done, retry): File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 726, in process_change if self.syncdata(datas): File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 623, in syncdata if self.wait(self.FLAT_DIR_HIERARCHY, None): File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 545, in wait ret = j[-1]() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 527, in <lambda> self.jobtab[path].append((label, a, lambda : job(*a, **kw))) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 617, in regjob st = lstat(se) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 392, in lstat return os.lstat(e) OSError: [Errno 5] Input/output error: '.gfid/39288d7f-ca78-40a6-96c9-ae22b4cd2e40' [2013-08-29 13:00:53.778892] I [syncdutils(/bricks/brick1):158:finalize] <top>: exiting. [2013-08-29 13:00:53.785870] I [monitor(monitor):81:set_state] Monitor: new state: faulty [2013-08-29 13:01:03.798213] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-08-29 13:01:03.798502] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2013-08-29 13:01:03.897695] I [gsyncd(/bricks/brick1):503:main_i] <top>: syncing: gluster://localhost:master -> ssh://root.43.25:gluster://localhost:slave [2013-08-29 13:02:03.859800] E [syncdutils(monitor):206:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 232, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon cpid, _ = self.monitor(w, argv, cpids) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor self.terminate() File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate set_term_handler(lambda *a: set_term_handler()) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 298, in set_term_handler signal(SIGTERM, hook) ValueError: signal only works in main thread [2013-08-29 13:02:03.860609] I [syncdutils(monitor):158:finalize] <top>: exiting. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version-Release number of selected component (if applicable):glusterfs-3.4.0.24rhs-1.el6rhs.x86_64 How reproducible: Didn't try to reproduce Steps to Reproduce: These are the steps which I did before hitting this crash 1.create and start a geo-rep relationship between master(Dist-rep) and slave(dist-rep). 2.create files on the master as non-root user using command "./crefi.py -n 10 --multi -b 10 -d 10 --random --max=500K --min=10 /mnt/master/"and let it sync to slave 3.then do chmod, chown , and chgrp to all the files,using commands "./crefi.py -n 10 --multi -b 10 -d 10 --random --max=500K --min=10 --fop=chmod /mnt/master/", "./crefi.py -n 10 --multi -b 10 -d 10 --random --max=500K --min=10 --fop=chown /mnt/master/", "./crefi.py -n 10 --multi -b 10 -d 10 --random --max=500K --min=10 --fop=chgrp /mnt/master/" 4. Then do truncate of all the files on master using the command "./crefi.py -n 10 --multi -b 10 -d 10 --random --max=500K --min=10 --fop=truncate /mnt/master/" Actual results: monitor crashed at "set_term_handler" and geo-rep status became defunct Expected results: monitor shouldn't crash. Additional info:
the monitor crash is a side effect of the worker process dying, which obviously should not happen as the monitor is responsible to restart the worker if it gets killed (here it should be within a minute, as that's the time interval monitor checks for worker correctness). The problem here is that the signalling mechanism is not working as expected as the monitor process in multi-threaded, ie. one thread for each worker on a given node.
*** This bug has been marked as a duplicate of bug 1044420 ***