Description of problem: Dist-geo-rep : after remove-brick commit operation, one geo rep instance get killed and syncdutils.py errors are found in log. geo rep session is defunct after that Version-Release number of selected component (if applicable): 3.4.0.33rhs-1.el6rhs.x86_64 How reproducible: haven't tried Steps to Reproduce: 1. create and start dist-rep volume and mount it.Start creating data on master volume from mount point. mount point:- mount | grep remove_xsync 10.70.35.179:/remove_xsync on /mnt/remove_xsync type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) 10.70.35.179:/remove_xsync on /mnt/remove_xsync_nfs type nfs (rw,addr=10.70.35.179) 2, create and start geo rep session between master and slave volume. 3. remove brick(s) from master volume with start option. --> gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 start 4. once remove-brick is completed perform commit operation gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 status gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 commit [root@old5 ~]# gluster v info remove_change Volume Name: remove_change Type: Distributed-Replicate Volume ID: eb500199-37d4-4cb9-96ed-ae5bc1bf2498 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.179:/rhs/brick3/c1 Brick2: 10.70.35.235:/rhs/brick3/c1 Brick3: 10.70.35.179:/rhs/brick3/c2 Brick4: 10.70.35.235:/rhs/brick3/c2 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on 5. after few time status was defunct and log has Traceback as below [root@old6 ~]# gluster vol geo remove_xsync status NODE MASTER SLAVE HEALTH UPTIME --------------------------------------------------------------------------------------------------------- old6.lab.eng.blr.redhat.com remove_xsync ssh://10.70.37.195::remove_xsync defunct N/A old5.lab.eng.blr.redhat.com remove_xsync ssh://10.70.37.195::remove_xsync Stable 16:11:35 log snippet:- [2013-09-16 14:58:43.673831] E [syncdutils(monitor):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon cpid, _ = self.monitor(w, argv, cpids) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor self.terminate() File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate set_term_handler(lambda *a: set_term_handler()) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 299, in set_term_handler signal(SIGTERM, hook) ValueError: signal only works in main thread [2013-09-16 14:58:44.734586] E [syncdutils(monitor):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon cpid, _ = self.monitor(w, argv, cpids) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor self.terminate() File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate set_term_handler(lambda *a: set_term_handler()) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 299, in set_term_handler signal(SIGTERM, hook) ValueError: signal only works in main thread [2013-09-16 14:58:47.82674] I [syncdutils(monitor):159:finalize] <top>: exiting. Actual results: status was defunct and log has Traceback Expected results: log should not have traceback . If process was killed due to some reason, it should have entry for that. Not able to get reason behind defunct Additional info:
Verified with build: glusterfs-3.7.1-10.el6rhs.x86_64 We have additional step to stop the geo-rep session before commit. Didn't observe the status going to defunct state. Also similar bugs 1002991 and 1044420 are moved to verified. Moving this bug to verified state too. Will create or reopen the bug with proper steps to reproduce incase we hit again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html