Bug 1044420

Summary:	dist-geo-rep: Python backtrace seen in geo-rep logs "ValueError: signal only works in main thread"
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	M S Vishwanath Bhat <vbhat>
Component:	geo-replication	Assignee:	Aravinda VK <avishwan>
Status:	CLOSED ERRATA	QA Contact:	M S Vishwanath Bhat <vbhat>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	2.1	CC:	aavati, asrivast, avishwan, bbandari, csaba, fdacunha, fharshav, mzywusko, nsathyan, pmutha, sdharane, ssamanta, vgaikwad, vkoppad, vshankar, vumrao
Target Milestone:	---
Target Release:	RHGS 3.0.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0.60geo-1.el6rhs	Doc Type:	Bug Fix
Doc Text:	Cause: On geo-rep worker crash, geo-rep was trying to handle the signal from a worker thread and due to limitation in python, signals can be handled only in main thread. Consequence: Geo-rep monitor itself will die, syncing will not happen from that node. Fix: Handled geo-rep worker crash gracefully in the code. Result: Even when geo-rep worker crash, geo-rep monitor will not die.	Story Points:	---
Clone Of:
Clones:	1078068 1126354 (view as bug list)		Environment:
Last Closed:	2014-09-22 19:30:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1073815, 1078068, 1126354

Description M S Vishwanath Bhat 2013-12-18 09:45:32 UTC

Description of problem:
I am hitting backtrace in geo-rep logs every time during the geo-rep automation run. The possible impact (not sure) of this is status is always gets stuck in the Initializing phase.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.50rhs-1.el6rhs.x86_64

How reproducible:
Have reproduced many times in automation. But haven't tried manually.

Steps to Reproduce:
1. Run a simple geo-rep start and status with some i/o on the mountpoint.
2.
3.

Actual results:
[2013-12-17 14:10:25.543190] I [master(/bricks/common_masterg_brick0):1065:crawl] _GMaster: slave's time: (1387307406, 0)
[2013-12-17 14:10:41.187948] E [syncdutils(monitor):240:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 266, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 190, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
    self.terminate()
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 332, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2013-12-17 14:10:41.189666] I [syncdutils(monitor):192:finalize] <top>: exiting.


Expected results:
There should be no python backtraces and no errors seen inlogs for simple geo-rep status.

Additional info:

geo-rep status output during the whole time was stuck at "Initializing" Phase. 

MASTER NODE                          MASTER VOL        MASTER BRICK                     SLAVE                                               STATUS             CHECKPOINT STATUS    CRAWL STATUS        
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhsauto028.lab.eng.blr.redhat.com    common_masterg    /bricks/common_masterg_brick0    rhsauto023.lab.eng.blr.redhat.com::common_slaveg    Initializing...    N/A                  N/A                 
rhsauto028.lab.eng.blr.redhat.com    common_masterg    /bricks/common_masterg_brick2    rhsauto021.lab.eng.blr.redhat.com::common_slaveg    Initializing...    N/A                  N/A                 
rhsauto030.lab.eng.blr.redhat.com    common_masterg    /bricks/common_masterg_brick1    rhsauto021.lab.eng.blr.redhat.com::common_slaveg    Initializing...    N/A                  N/A                 
rhsauto030.lab.eng.blr.redhat.com    common_masterg    /bricks/common_masterg_brick3    rhsauto023.lab.eng.blr.redhat.com::common_slaveg    Initializing...    N/A                  N/A

Comment 1 M S Vishwanath Bhat 2013-12-18 09:46:23 UTC

Beaker job ID to get the logs.


https://beaker.engineering.redhat.com/jobs/562735

Comment 2 M S Vishwanath Bhat 2013-12-18 09:50:08 UTC

After logging into the machine, I found that there are some more back traces with file not exists. That may explain the status being stuck at 'Initializing' state. But how can that state be reached?


[2013-12-17 14:21:59.510833] I [master(/bricks/common_masterg_brick0):451:crawlwrap] _GMaster: 20 crawls, 0 turns
[2013-12-17 14:22:50.730902] I [master(/bricks/common_masterg_brick0):918:update_worker_status] _GMaster: Creating new /var/lib/glusterd/geo-replicati
on/common_masterg_rhsauto021.lab.eng.blr.redhat.com_common_slaveg/_bricks_common_masterg_brick0.status
[2013-12-17 14:22:50.731693] E [syncdutils(/bricks/common_masterg_brick0):240:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 459, in crawlwrap
    self.update_worker_remote_node()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 968, in update_worker_remote_node
    self.update_worker_status ('remote_node', remote_node)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 920, in update_worker_status
    with open(worker_status_file, 'wb') as f:
IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/geo-replication/common_masterg_rhsauto021.lab.eng.blr.redhat.com_common_slaveg/_bricks_common_masterg_brick0.status'
[2013-12-17 14:22:50.735240] I [syncdutils(/bricks/common_masterg_brick0):192:finalize] <top>: exiting.
[2013-12-17 14:22:51.330389] I [master(/bricks/common_masterg_brick2):918:update_worker_status] _GMaster: Creating new /var/lib/glusterd/geo-replication/common_masterg_rhsauto021.lab.eng.blr.redhat.com_common_slaveg/_bricks_common_masterg_brick2.status
[2013-12-17 14:22:51.331115] E [syncdutils(/bricks/common_masterg_brick2):240:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):


But how can status file disappear?

Comment 4 Vijaykumar Koppad 2014-01-29 06:43:04 UTC

This has happened in the build glusterfs-3.4.0.58rhs-1, and I have seen it many times,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-01-28 19:03:39.467679] E [syncdutils(monitor):240:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 266, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 190, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
    self.terminate()
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 332, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2014-01-28 19:03:39.469316] I [syncdutils(monitor):192:finalize] <top>: exiting.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Once this crash happens, the status gets stuck in initializing state even though it syncs data.

Comment 5 Vijaykumar Koppad 2014-02-12 11:24:22 UTC

This issue has been hit many times. When there is such a backtrace, monitor would have been killed and the status gets stuck in the same state, as it was before the monitor was killed.

In this cases ,


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-02-12 16:42:03.223953] E [resource(/bricks/master_brick6):207:logerr] Popen: ssh> Write failed: Broken pipe
[2014-02-12 16:42:03.225075] I [syncdutils(/bricks/master_brick6):192:finalize] <top>: exiting.
[2014-02-12 16:42:03.238873] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2014-02-12 16:42:13.270226] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2014-02-12 16:42:13.270837] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2014-02-12 16:42:13.614228] I [gsyncd(/bricks/master_brick6):530:main_i] <top>: syncing: gluster://localhost:master -> ssh://root.43.174:gluster://localhost:slave
[2014-02-12 16:42:13.648677] E [syncdutils(/bricks/master_brick6):223:log_raise_exception] <top>: connection to peer is broken
[2014-02-12 16:42:13.649338] E [resource(/bricks/master_brick6):204:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-CobNBj/061fc87d252b63093ab9bfb765588973.sock root.43.174 /nonexistent/gsyncd --session-owner f8513fdd-9281-49c6-a691-4f8634794875 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying:
[2014-02-12 16:42:13.649745] E [resource(/bricks/master_brick6):207:logerr] Popen: ssh> ssh: connect to host 10.70.43.174 port 22: Connection refused
[2014-02-12 16:42:13.650429] I [syncdutils(/bricks/master_brick6):192:finalize] <top>: exiting.
[2014-02-12 16:42:23.664001] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2014-02-12 16:42:23.664845] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2014-02-12 16:42:24.51059] I [gsyncd(/bricks/master_brick6):530:main_i] <top>: syncing: gluster://localhost:master -> ssh://root.43.174:gluster://localhost:slave
[2014-02-12 16:42:55.674746] I [master(/bricks/master_brick10):451:crawlwrap] _GMaster: 0 crawls, 0 turns
[2014-02-12 16:42:56.537420] I [master(/bricks/master_brick2):451:crawlwrap] _GMaster: 0 crawls, 0 turns
[2014-02-12 16:43:23.715070] E [syncdutils(monitor):240:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 266, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 190, in wmon
    cpid, _ = self.monitor(w, argv, cpids)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor
    self.terminate()
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate
    set_term_handler(lambda *a: set_term_handler())
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 332, in set_term_handler
    signal(SIGTERM, hook)
ValueError: signal only works in main thread
[2014-02-12 16:43:23.716845] I [syncdutils(monitor):192:finalize] <top>: exiting.
[2014-02-12 16:43:25.311342] E [syncdutils(/bricks/master_brick6):223:log_raise_exception] <top>: connection to peer is broken
[2014-02-12 16:43:25.313623] E [resource(/bricks/master_brick6):204:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-UYesaG/061fc87d252b63093ab9bfb765588973.sock root.43.174 /nonexistent/gsyncd --session-owner f8513fdd-9281-49c6-a691-4f8634794875 -N --listen --timeout 120 gluster://localhost:slave" returned with 1, saying:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


here, before the monitor was died, the status was faulty and it remained in the same state. Though the worker for brick6 got disconnected and died, the status was showing faulty for all the bricks.

Comment 24 M S Vishwanath Bhat 2014-04-08 07:07:44 UTC

Moving the bug to Verified, since I didn't find it in two beaker job runs.

Comment 25 M S Vishwanath Bhat 2014-04-08 15:35:16 UTC

I ran the basic automated test in beaker where we used to hit this same issue. The test case just creates some files in master and tries to sync them to slave and performs some CLI operations like checkpoint and stuff on the master side. The syncing method is rsync itself.

In earlier builds, we saw this issue pretty much all the time. But after the fix, in 60geo build, I have not seen this issue. And vkoppad, who had hit this issue in manual testing, also didn't see it in 60geo. So moved to verified.

Comment 27 Nagaprasad Sathyanarayana 2014-05-19 10:56:40 UTC

Setting flags required to add BZs to RHS 3.0 Errata

Comment 28 Aravinda VK 2014-07-07 06:55:50 UTC

*** Bug 1062159 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2014-09-22 19:30:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Comment 31 Aravinda VK 2014-12-29 06:18:23 UTC

*** Bug 1002468 has been marked as a duplicate of this bug. ***

Comment 32 Aravinda VK 2015-01-05 08:34:47 UTC

*** Bug 1064598 has been marked as a duplicate of this bug. ***