Bug 1002468 - Dist-geo-rep : geo-rep monitor process crashed in set_term_handler after worker crashed with IO error.
Summary: Dist-geo-rep : geo-rep monitor process crashed in set_term_handler after work...
Keywords:
Status: CLOSED DUPLICATE of bug 1044420
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Aravinda VK
QA Contact: amainkar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-29 09:36 UTC by Vijaykumar Koppad
Modified: 2015-05-12 18:18 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-29 06:18:23 UTC
Embargoed:


Attachments (Terms of Use)

Description Vijaykumar Koppad 2013-08-29 09:36:31 UTC
Description of problem: geo-rep monitor crashed in set_term_handler after worker crashed with IO error. 

backtrace was like 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-08-29 13:00:53.777460] E [syncdutils(/bricks/brick1):206:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 133, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 513, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1062, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 369, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 783, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 744, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 726, in process_change
    if self.syncdata(datas):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 623, in syncdata
    if self.wait(self.FLAT_DIR_HIERARCHY, None):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 545, in wait
    ret = j[-1]()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 527, in <lambda>
    self.jobtab[path].append((label, a, lambda : job(*a, **kw)))
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 617, in regjob
    st = lstat(se)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 392, in lstat

    return os.lstat(e)
OSError: [Errno 5] Input/output error: '.gfid/39288d7f-ca78-40a6-96c9-ae22b4cd2e40'
[2013-08-29 13:00:53.778892] I [syncdutils(/bricks/brick1):158:finalize] <top>: exiting.
[2013-08-29 13:00:53.785870] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2013-08-29 13:01:03.798213] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-08-29 13:01:03.798502] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-08-29 13:01:03.897695] I [gsyncd(/bricks/brick1):503:main_i] <top>: syncing: gluster://localhost:master -> ssh://root.43.25:gluster://localhost:slave
[2013-08-29 13:02:03.859800] E [syncdutils(monitor):206:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 232, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
    self.terminate()
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 298, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2013-08-29 13:02:03.860609] I [syncdutils(monitor):158:finalize] <top>: exiting.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable):glusterfs-3.4.0.24rhs-1.el6rhs.x86_64


How reproducible: Didn't try to reproduce


Steps to Reproduce:
These are the steps which I did before hitting this crash
1.create and start a geo-rep relationship between master(Dist-rep) and slave(dist-rep). 
2.create files on the master as non-root user using command "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10  /mnt/master/"and let it sync to slave  
3.then do chmod, chown , and chgrp to all the files,using commands  "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 --fop=chmod  /mnt/master/", "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 --fop=chown  /mnt/master/", "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 --fop=chgrp  /mnt/master/"

4. Then do truncate of all the files on master using the command "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 --fop=truncate  /mnt/master/"

Actual results: monitor crashed at "set_term_handler" and geo-rep status became defunct 


Expected results: monitor shouldn't crash.


Additional info:

Comment 2 Venky Shankar 2013-09-02 13:45:24 UTC
the monitor crash is a side effect of the worker process dying, which obviously should not happen as the monitor is responsible to restart the worker if it gets killed (here it should be within a minute, as that's the time interval monitor checks for worker correctness).

The problem here is that the signalling mechanism is not working as expected as the monitor process in multi-threaded, ie. one thread for each worker on a given node.

Comment 6 Aravinda VK 2014-12-29 06:18:23 UTC

*** This bug has been marked as a duplicate of bug 1044420 ***


Note You need to log in before you can comment on or make changes to this bug.