Bug 1155335
Summary: | ceph mon_status hangs | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Pete Zaitcev <zaitcev> |
Component: | ceph | Assignee: | Boris Ranto <branto> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 21 | CC: | bkabrda, branto, david, dmalcolm, fedora, harm, ivazqueznet, jonathansteffan, kkeithle, linuxkidd, mstuchli, ncoghlan, rkuska, steve.capper, steve, tomspur, tradej |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | ceph-0.80.7-3.fc21 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-01-26 02:31:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pete Zaitcev
2014-10-21 22:51:36 UTC
The hang happens at the end, where sys.exit() is called. Something gums up the handling of SystemExit, evidently. Not 100% sure about it, but looks that the server fails to close the connection, and the client depends on it. Not looked at the code yet, but here's the straces. Good: 8303 recvfrom(3, "\350\202o\177\0\0\0\0\201\336[\214\0\0\0\0\0\0\0\0\1", 21, MSG_DONTWAIT, NULL, NULL) = 21 8303 poll([{fd=3, events=POLLIN|0x2000}], 1, 900000 <unfinished ...> 8300 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\10", 1}, {"\f\0\0\0\0\0\0\0", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL|MSG_MORE) = 9 8280 write(1, "{\"name\":\"simbelmyne\",\"rank\":0,\"s"..., 360) = 360 8303 <... poll resumed> ) = 1 ([{fd=3, revents=POLLIN|POLLHUP|0x2000}]) Bad: 29096 recvfrom(3, "\350\202o\177\0\0\0\0\316\373\4*\0\0\0\0\0\0\0\0\1", 21, MSG_DONTWAIT, NULL, NULL) = 21 29096 poll([{fd=3, events=POLLIN|0x2000}], 1, 900000 <unfinished ...> 29093 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\10", 1}, {"\f\0\0\0\0\0\0\0", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL|MSG_MORE <unfinished ...> 29093 <... sendmsg resumed> ) = 9 29075 write(1, "{\"name\":\"kvm-ichi\",\"rank\":0,\"sta"..., 327) = 327 They both send the \f message, but in the hang case there's no POLLHUP in response. On the other hand... If we make clients talk to other servers with ceph -m host, then the fault seems to lie with the client. Client from "bad" machine talking to server on "good" machine - hangs. Client from "good" machine talking to server on "bad" machine - works okay. Never mind, comment #2 was terribly misleading. Something else really occurs. The problem is that we have a __del__ invoking shutdown(). When interpreter exits, it pulls all the destructors and so we end in shutdown() while in sys.exit(). Then, we use run_in_thread() to run a C function rados_shutdown. But run_in_thread hangs: class RadosThread(threading.Thread): def __init__(self, target, args=None): self.args = args self.target = target threading.Thread.__init__(self) def run(self): self.retval = self.target(*self.args) def run_in_thread(target, args, timeout=0): t = RadosThread(target, args) t.start() # <========= HANGS HERE It hangs like this: 29075 clone(child_stack=0x7fbf56a5eff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fbf56a5f9d0, tls=0x7fbf56a5f700, child_tidptr=0x7fbf56a5f9d0) = 29103 29075 futex(0x20e3880, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 29103 set_robust_list(0x7fbf56a5f9e0, 24) = 0 29103 madvise(0x7fbf5625f000, 8368128, MADV_DONTNEED) = 0 29103 _exit(0) = ? Basically the module threading forks and waits on futex, but the forked thread exits immediately without doing anything instead of waking our futex. And so we hang. Hi, this was reported upstream a while back [1] and it turned to be a regression in python [2]. Hence, I'm reassigning this to python. The fix currently seems to be to revert the commit that caused the regression or rebase python once it is fixed there (see [2] for details). [1] http://tracker.ceph.com/issues/8797 [2] http://bugs.python.org/issue21963? Hi, I've cherry-picked the following from https://hg.python.org/cpython: changeset: 93526:4ceca79d1c63 branch: 2.7 parent: 93521:8bc29f5ebeff user: Antoine Pitrou <solipsis> date: Fri Nov 21 02:04:21 2014 +0100 summary: Issue #21963: backout issue #1856 patch (avoid crashes and lockups when and rebuilt the package (for AArch64 running Fedora 21). This fixed the issues I had experienced with Ceph. Could this please be either cherry-picked into the Python package or could the base revision be updated to one after: 93526:4ceca79d1c63? Cheers, -- Steve This is fixed in Ceph code. Details at: http://tracker.ceph.com/issues/8797 I just manually applied the patch detailed in that tracker and have been able to completely deploy my ceph cluster on Fedora 21. Shifting from python to ceph. 2 files/rpms will need to be updated: ceph:/usr/bin/ceph python-rados:/usr/lib/python2.7/site-packages/rados.py Without these changes, Ceph is practically useless on Fedora 21. Thanks, Michael ceph-0.80.7-3.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/ceph-0.80.7-3.fc21 Package ceph-0.80.7-3.fc21: * should fix your issue, * was pushed to the Fedora 21 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing ceph-0.80.7-3.fc21' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2015-0723/ceph-0.80.7-3.fc21 then log in and leave karma (feedback). ceph-0.80.7-3.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. |