Bug 1044420
Summary: | dist-geo-rep: Python backtrace seen in geo-rep logs "ValueError: signal only works in main thread" | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | M S Vishwanath Bhat <vbhat> | |
Component: | geo-replication | Assignee: | Aravinda VK <avishwan> | |
Status: | CLOSED ERRATA | QA Contact: | M S Vishwanath Bhat <vbhat> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 2.1 | CC: | aavati, asrivast, avishwan, bbandari, csaba, fdacunha, fharshav, mzywusko, nsathyan, pmutha, sdharane, ssamanta, vgaikwad, vkoppad, vshankar, vumrao | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.0.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.4.0.60geo-1.el6rhs | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
On geo-rep worker crash, geo-rep was trying to handle the signal from a worker thread and due to limitation in python, signals can be handled only in main thread.
Consequence:
Geo-rep monitor itself will die, syncing will not happen from that node.
Fix:
Handled geo-rep worker crash gracefully in the code.
Result:
Even when geo-rep worker crash, geo-rep monitor will not die.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1078068 1126354 (view as bug list) | Environment: | ||
Last Closed: | 2014-09-22 19:30:36 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1073815, 1078068, 1126354 |
Description
M S Vishwanath Bhat
2013-12-18 09:45:32 UTC
Beaker job ID to get the logs. https://beaker.engineering.redhat.com/jobs/562735 After logging into the machine, I found that there are some more back traces with file not exists. That may explain the status being stuck at 'Initializing' state. But how can that state be reached? [2013-12-17 14:21:59.510833] I [master(/bricks/common_masterg_brick0):451:crawlwrap] _GMaster: 20 crawls, 0 turns [2013-12-17 14:22:50.730902] I [master(/bricks/common_masterg_brick0):918:update_worker_status] _GMaster: Creating new /var/lib/glusterd/geo-replicati on/common_masterg_rhsauto021.lab.eng.blr.redhat.com_common_slaveg/_bricks_common_masterg_brick0.status [2013-12-17 14:22:50.731693] E [syncdutils(/bricks/common_masterg_brick0):240:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 459, in crawlwrap self.update_worker_remote_node() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 968, in update_worker_remote_node self.update_worker_status ('remote_node', remote_node) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 920, in update_worker_status with open(worker_status_file, 'wb') as f: IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/geo-replication/common_masterg_rhsauto021.lab.eng.blr.redhat.com_common_slaveg/_bricks_common_masterg_brick0.status' [2013-12-17 14:22:50.735240] I [syncdutils(/bricks/common_masterg_brick0):192:finalize] <top>: exiting. [2013-12-17 14:22:51.330389] I [master(/bricks/common_masterg_brick2):918:update_worker_status] _GMaster: Creating new /var/lib/glusterd/geo-replication/common_masterg_rhsauto021.lab.eng.blr.redhat.com_common_slaveg/_bricks_common_masterg_brick2.status [2013-12-17 14:22:51.331115] E [syncdutils(/bricks/common_masterg_brick2):240:log_raise_exception] <top>: FAIL: Traceback (most recent call last): But how can status file disappear? This has happened in the build glusterfs-3.4.0.58rhs-1, and I have seen it many times, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2014-01-28 19:03:39.467679] E [syncdutils(monitor):240:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 266, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 190, in wmon cpid, _ = self.monitor(w, argv, cpids) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor self.terminate() File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate set_term_handler(lambda *a: set_term_handler()) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 332, in set_term_handler signal(SIGTERM, hook) ValueError: signal only works in main thread [2014-01-28 19:03:39.469316] I [syncdutils(monitor):192:finalize] <top>: exiting. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Once this crash happens, the status gets stuck in initializing state even though it syncs data. This issue has been hit many times. When there is such a backtrace, monitor would have been killed and the status gets stuck in the same state, as it was before the monitor was killed. In this cases , >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2014-02-12 16:42:03.223953] E [resource(/bricks/master_brick6):207:logerr] Popen: ssh> Write failed: Broken pipe [2014-02-12 16:42:03.225075] I [syncdutils(/bricks/master_brick6):192:finalize] <top>: exiting. [2014-02-12 16:42:03.238873] I [monitor(monitor):81:set_state] Monitor: new state: faulty [2014-02-12 16:42:13.270226] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2014-02-12 16:42:13.270837] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2014-02-12 16:42:13.614228] I [gsyncd(/bricks/master_brick6):530:main_i] <top>: syncing: gluster://localhost:master -> ssh://root.43.174:gluster://localhost:slave [2014-02-12 16:42:13.648677] E [syncdutils(/bricks/master_brick6):223:log_raise_exception] <top>: connection to peer is broken [2014-02-12 16:42:13.649338] E [resource(/bricks/master_brick6):204:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-CobNBj/061fc87d252b63093ab9bfb765588973.sock root.43.174 /nonexistent/gsyncd --session-owner f8513fdd-9281-49c6-a691-4f8634794875 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying: [2014-02-12 16:42:13.649745] E [resource(/bricks/master_brick6):207:logerr] Popen: ssh> ssh: connect to host 10.70.43.174 port 22: Connection refused [2014-02-12 16:42:13.650429] I [syncdutils(/bricks/master_brick6):192:finalize] <top>: exiting. [2014-02-12 16:42:23.664001] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2014-02-12 16:42:23.664845] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2014-02-12 16:42:24.51059] I [gsyncd(/bricks/master_brick6):530:main_i] <top>: syncing: gluster://localhost:master -> ssh://root.43.174:gluster://localhost:slave [2014-02-12 16:42:55.674746] I [master(/bricks/master_brick10):451:crawlwrap] _GMaster: 0 crawls, 0 turns [2014-02-12 16:42:56.537420] I [master(/bricks/master_brick2):451:crawlwrap] _GMaster: 0 crawls, 0 turns [2014-02-12 16:43:23.715070] E [syncdutils(monitor):240:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 266, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 190, in wmon cpid, _ = self.monitor(w, argv, cpids) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 161, in monitor self.terminate() File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 89, in terminate set_term_handler(lambda *a: set_term_handler()) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 332, in set_term_handler signal(SIGTERM, hook) ValueError: signal only works in main thread [2014-02-12 16:43:23.716845] I [syncdutils(monitor):192:finalize] <top>: exiting. [2014-02-12 16:43:25.311342] E [syncdutils(/bricks/master_brick6):223:log_raise_exception] <top>: connection to peer is broken [2014-02-12 16:43:25.313623] E [resource(/bricks/master_brick6):204:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-UYesaG/061fc87d252b63093ab9bfb765588973.sock root.43.174 /nonexistent/gsyncd --session-owner f8513fdd-9281-49c6-a691-4f8634794875 -N --listen --timeout 120 gluster://localhost:slave" returned with 1, saying: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here, before the monitor was died, the status was faulty and it remained in the same state. Though the worker for brick6 got disconnected and died, the status was showing faulty for all the bricks. Moving the bug to Verified, since I didn't find it in two beaker job runs. I ran the basic automated test in beaker where we used to hit this same issue. The test case just creates some files in master and tries to sync them to slave and performs some CLI operations like checkpoint and stuff on the master side. The syncing method is rsync itself. In earlier builds, we saw this issue pretty much all the time. But after the fix, in 60geo build, I have not seen this issue. And vkoppad, who had hit this issue in manual testing, also didn't see it in 60geo. So moved to verified. Setting flags required to add BZs to RHS 3.0 Errata *** Bug 1062159 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html *** Bug 1002468 has been marked as a duplicate of this bug. *** *** Bug 1064598 has been marked as a duplicate of this bug. *** |