Bug 314641
| Summary: | CMAN dies after qdiskd calls cman_poll_quorum_device() | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Lon Hohberger <lhh> | ||||
| Component: | openais | Assignee: | Steven Dake <sdake> | ||||
| Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 5.1 | CC: | cluster-maint, gustavo.prada, h.plankl, pkennedy, rkenna, sdake | ||||
| Target Milestone: | --- | Keywords: | Regression | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | RHBA-2007-0599 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2007-11-07 17:00:18 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 253836 | ||||||
| Attachments: |
|
||||||
Created attachment 212761 [details]
Basic debugging (thread backtrace from aisexec on all nodes, qdisk backtrace on all nodes but 1, cman_tool backtrace)
It looks to me, like aisexec is deadlocking because it doesn't like me calling openais_timer_add_duration() whilst in a timer callback function. Is there another way of doing a repeating timer ? ahhh, there seems to be a bug :-)
According to the man page:
The pthread_equal() function shall return a non-zero value if t1 and t2
are equal; otherwise, zero shall be returned.
So I think we need to do this in openais_timer_add_absolute() and
openais_timer_add_duration()
- if (pthread_equal (pthread_self(), expiry_thread) == 0) {
+ if (pthread_equal (pthread_self(), expiry_thread) != 0) {
[root@tng3-5 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ tng3-1 1 Online tng3-2 2 Online tng3-3 3 Online tng3-5 5 Online, Local /dev/sdd1 0 Online, Quorum Disk [root@tng3-5 ~]# rpm -q openais openais-0.80.3-6.el5 Preliminary tests pass. [root@tng3-5 ~]# cman_tool status Version: 6.0.1 Config Version: 9 Cluster Name: tng3-cluster Cluster Id: 41908 Cluster Member: Yes Cluster Generation: 2448 Membership state: Cluster-Member Nodes: 4 Expected votes: 4 Total votes: 7 Quorum: 4 Active subsystems: 9 Flags: Ports Bound: 0 11 177 Node name: tng3-5 Node ID: 5 Multicast addresses: 239.192.163.88 Node addresses: 10.15.89.178 [root@tng3-5 ~]# cman_tool members cman_tool: unknown option members [root@tng3-5 ~]# cman_tool nodes Node Sts Inc Joined Name 0 M 0 2007-10-02 12:15:25 /dev/sdd1 1 M 2448 2007-10-02 12:15:11 tng3-1 2 M 2448 2007-10-02 12:15:11 tng3-2 3 M 2448 2007-10-02 12:15:11 tng3-3 5 M 2444 2007-10-02 12:15:11 tng3-5 going to do some functional tests. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0599.html *** Bug 253836 has been marked as a duplicate of this bug. *** *** Bug 431382 has been marked as a duplicate of this bug. *** |
Description of problem: Version-Release number of selected component (if applicable): 2.0.70-1.el5 How reproducible: ??? Actual results: CMAN hangs. At this point, cman_tool stops working, and all calls to libcman block. For example: (gdb) thr a a bt Thread 2 (Thread -1208284272 (LWP 2006)): #0 0x0080f402 in __kernel_vsyscall () #1 0x0093f846 in nanosleep () from /lib/libc.so.6 #2 0x0093f66f in sleep () from /lib/libc.so.6 #3 0x080503b1 in score_thread_main (arg=0x8b9d238) at score.c:371 #4 0x00a2743b in start_thread () from /lib/libpthread.so.0 #5 0x0097efde in clone () from /lib/libc.so.6 Thread 1 (Thread -1208281408 (LWP 2005)): #0 0x0080f402 in __kernel_vsyscall () #1 0x00a2e118 in recv () from /lib/libpthread.so.0 #2 0x080523d7 in cman_dispatch (handle=0x8b9d008, flags=26) at libcman.c:522 #3 0x08051511 in wait_for_reply (h=0x8b9d008, msg=0x8ba7208, max_len=6784) at libcman.c:80 #4 0x08051b2a in info_call (h=0x8b9d008, msgtype=7, inbuf=0x0, inlen=0, outbuf=0x8ba7208, outlen=6784) at libcman.c:287 #5 0x080526fa in cman_get_nodes (handle=0x8b9d008, maxnodes=16, retnodes=0xbfe9f1f0, nodes=0xbfe9f1f4) at libcman.c:606 #6 0x0804cf60 in check_cman (ctx=0xbfea1084, mask=0xbfea05c8 "\001", master_mask=0xbfea05c0 "\001") at main.c:551 #7 0x0804e27a in quorum_loop (ctx=0xbfea1084, ni=0xbfea0804, max=16) at main.c:1020 #8 0x0804fa2d in main (argc=2, argv=0xbfea14b4) at main.c:1536 #9 0x008c6dec in __libc_start_main () from /lib/libc.so.6 #10 0x08049591 in _start () #0 0x0080f402 in __kernel_vsyscall () (gdb) Expected results: Calls to libcman / cman_tool should never get stuck this way. Additional info: openais-0.80.3-5.el5