1010327 – Dist-geo-rep : session status is defunct after syncdutils.py errors in log

Bug 1010327 - Dist-geo-rep : session status is defunct after syncdutils.py errors in log

Summary: Dist-geo-rep : session status is defunct after syncdutils.py errors in log

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Aravinda VK
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1202842 1223636
TreeView+	depends on / blocked

Reported:	2013-09-20 14:07 UTC by Rachana Patel
Modified:	2015-07-29 04:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.7.0-2.el6rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-07-29 04:29:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description Rachana Patel 2013-09-20 14:07:41 UTC

Description of problem:
Dist-geo-rep : after remove-brick commit operation, one geo rep instance get killed and syncdutils.py errors  are found in log. geo rep session is defunct after that

Version-Release number of selected component (if applicable):
3.4.0.33rhs-1.el6rhs.x86_64

How reproducible:
haven't tried

Steps to Reproduce:
1.  create and start dist-rep volume and mount it.Start creating data on master volume from mount point. 

mount point:-
mount | grep remove_xsync
10.70.35.179:/remove_xsync on /mnt/remove_xsync type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
10.70.35.179:/remove_xsync on /mnt/remove_xsync_nfs type nfs (rw,addr=10.70.35.179)

2, create and start geo rep session between master and slave volume.

3. remove brick(s) from master volume with start option.

--> gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 start

4. once remove-brick is completed perform commit operation
 gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 status
 gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 commit

[root@old5 ~]# gluster v info remove_change
 
Volume Name: remove_change
Type: Distributed-Replicate
Volume ID: eb500199-37d4-4cb9-96ed-ae5bc1bf2498
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.179:/rhs/brick3/c1
Brick2: 10.70.35.235:/rhs/brick3/c1
Brick3: 10.70.35.179:/rhs/brick3/c2
Brick4: 10.70.35.235:/rhs/brick3/c2
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on

5. after few time status was defunct and log has Traceback as below
[root@old6 ~]# gluster vol geo remove_xsync status
	NODE                           MASTER          SLAVE                               HEALTH     UPTIME         
---------------------------------------------------------------------------------------------------------
old6.lab.eng.blr.redhat.com    remove_xsync    ssh://10.70.37.195::remove_xsync    defunct    N/A            
old5.lab.eng.blr.redhat.com    remove_xsync    ssh://10.70.37.195::remove_xsync    Stable     16:11:35   


log snippet:-
[2013-09-16 14:58:43.673831] E [syncdutils(monitor):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
    self.terminate()
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 299, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2013-09-16 14:58:44.734586] E [syncdutils(monitor):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 203, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
    self.terminate()
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 299, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2013-09-16 14:58:47.82674] I [syncdutils(monitor):159:finalize] <top>: exiting.

Actual results:
status was defunct and log has Traceback

Expected results:
log should not have traceback . If process was killed due to some reason, it should have entry for that.
Not able to get reason behind defunct

Additional info:

Comment 9 Rahul Hinduja 2015-07-16 12:29:10 UTC

Verified with build: glusterfs-3.7.1-10.el6rhs.x86_64

We have additional step to stop the geo-rep session before commit. Didn't observe the status going to defunct state. Also similar bugs 1002991 and 1044420 are moved to verified. 

Moving this bug to verified state too. Will create or reopen the bug with proper steps to reproduce incase we hit again.

Comment 12 errata-xmlrpc 2015-07-29 04:29:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.