Bug 1010327

Summary: Dist-geo-rep : session status is defunct after syncdutils.py errors in log
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rachana Patel <racpatel>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED ERRATA QA Contact: Rahul Hinduja <rhinduja>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: aavati, annair, avishwan, csaba, mzywusko, rhinduja, vagarwal
Target Milestone: ---   
Target Release: RHGS 3.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.0-2.el6rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-29 04:29:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1202842, 1223636    

Description Rachana Patel 2013-09-20 14:07:41 UTC
Description of problem:
Dist-geo-rep : after remove-brick commit operation, one geo rep instance get killed and syncdutils.py errors  are found in log. geo rep session is defunct after that

Version-Release number of selected component (if applicable):
3.4.0.33rhs-1.el6rhs.x86_64

How reproducible:
haven't tried

Steps to Reproduce:
1.  create and start dist-rep volume and mount it.Start creating data on master volume from mount point. 

mount point:-
mount | grep remove_xsync
10.70.35.179:/remove_xsync on /mnt/remove_xsync type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
10.70.35.179:/remove_xsync on /mnt/remove_xsync_nfs type nfs (rw,addr=10.70.35.179)

2, create and start geo rep session between master and slave volume.

3. remove brick(s) from master volume with start option.

--> gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 start

4. once remove-brick is completed perform commit operation
 gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 status
 gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 commit

[root@old5 ~]# gluster v info remove_change
 
Volume Name: remove_change
Type: Distributed-Replicate
Volume ID: eb500199-37d4-4cb9-96ed-ae5bc1bf2498
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.179:/rhs/brick3/c1
Brick2: 10.70.35.235:/rhs/brick3/c1
Brick3: 10.70.35.179:/rhs/brick3/c2
Brick4: 10.70.35.235:/rhs/brick3/c2
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on

5. after few time status was defunct and log has Traceback as below
[root@old6 ~]# gluster vol geo remove_xsync status
	NODE                           MASTER          SLAVE                               HEALTH     UPTIME         
---------------------------------------------------------------------------------------------------------
old6.lab.eng.blr.redhat.com    remove_xsync    ssh://10.70.37.195::remove_xsync    defunct    N/A            
old5.lab.eng.blr.redhat.com    remove_xsync    ssh://10.70.37.195::remove_xsync    Stable     16:11:35   


log snippet:-
[2013-09-16 14:58:43.673831] E [syncdutils(monitor):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
    self.terminate()
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 299, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2013-09-16 14:58:44.734586] E [syncdutils(monitor):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
    self.terminate()
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 299, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2013-09-16 14:58:47.82674] I [syncdutils(monitor):159:finalize] <top>: exiting.

Actual results:
status was defunct and log has Traceback

Expected results:
log should not have traceback . If process was killed due to some reason, it should have entry for that.
Not able to get reason behind defunct

Additional info:

Comment 9 Rahul Hinduja 2015-07-16 12:29:10 UTC
Verified with build: glusterfs-3.7.1-10.el6rhs.x86_64

We have additional step to stop the geo-rep session before commit. Didn't observe the status going to defunct state. Also similar bugs 1002991 and 1044420 are moved to verified. 

Moving this bug to verified state too. Will create or reopen the bug with proper steps to reproduce incase we hit again.

Comment 12 errata-xmlrpc 2015-07-29 04:29:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html