Bug 314641 - CMAN dies after qdiskd calls cman_poll_quorum_device()
CMAN dies after qdiskd calls cman_poll_quorum_device()
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.1
All Linux
high Severity high
: ---
: ---
Assigned To: Steven Dake
GFS Bugs
: Regression
: 253836 431382 (view as bug list)
Depends On:
Blocks: 253836
  Show dependency treegraph
 
Reported: 2007-10-01 15:56 EDT by Lon Hohberger
Modified: 2016-04-26 12:49 EDT (History)
6 users (show)

See Also:
Fixed In Version: RHBA-2007-0599
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 12:00:18 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Basic debugging (thread backtrace from aisexec on all nodes, qdisk backtrace on all nodes but 1, cman_tool backtrace) (6.43 KB, application/octet-stream)
2007-10-01 15:56 EDT, Lon Hohberger
no flags Details

  None (edit)
Description Lon Hohberger 2007-10-01 15:56:00 EDT
Description of problem:


Version-Release number of selected component (if applicable): 2.0.70-1.el5
How reproducible: ???
Actual results: CMAN hangs.  At this point, cman_tool stops working, and all
calls to libcman block.  For example:

(gdb) thr a a bt

Thread 2 (Thread -1208284272 (LWP 2006)):
#0  0x0080f402 in __kernel_vsyscall ()
#1  0x0093f846 in nanosleep () from /lib/libc.so.6
#2  0x0093f66f in sleep () from /lib/libc.so.6
#3  0x080503b1 in score_thread_main (arg=0x8b9d238) at score.c:371
#4  0x00a2743b in start_thread () from /lib/libpthread.so.0
#5  0x0097efde in clone () from /lib/libc.so.6

Thread 1 (Thread -1208281408 (LWP 2005)):
#0  0x0080f402 in __kernel_vsyscall ()
#1  0x00a2e118 in recv () from /lib/libpthread.so.0
#2  0x080523d7 in cman_dispatch (handle=0x8b9d008, flags=26) at libcman.c:522
#3  0x08051511 in wait_for_reply (h=0x8b9d008, msg=0x8ba7208, max_len=6784)
    at libcman.c:80
#4  0x08051b2a in info_call (h=0x8b9d008, msgtype=7, inbuf=0x0, inlen=0, 
    outbuf=0x8ba7208, outlen=6784) at libcman.c:287
#5  0x080526fa in cman_get_nodes (handle=0x8b9d008, maxnodes=16, 
    retnodes=0xbfe9f1f0, nodes=0xbfe9f1f4) at libcman.c:606
#6  0x0804cf60 in check_cman (ctx=0xbfea1084, mask=0xbfea05c8 "\001", 
    master_mask=0xbfea05c0 "\001") at main.c:551
#7  0x0804e27a in quorum_loop (ctx=0xbfea1084, ni=0xbfea0804, max=16)
    at main.c:1020
#8  0x0804fa2d in main (argc=2, argv=0xbfea14b4) at main.c:1536
#9  0x008c6dec in __libc_start_main () from /lib/libc.so.6
#10 0x08049591 in _start ()
#0  0x0080f402 in __kernel_vsyscall ()
(gdb) 

Expected results: Calls to libcman / cman_tool should never get stuck this way.

Additional info: openais-0.80.3-5.el5
Comment 1 Lon Hohberger 2007-10-01 15:56:00 EDT
Created attachment 212761 [details]
Basic debugging (thread backtrace from aisexec on all nodes, qdisk backtrace on all nodes but 1, cman_tool backtrace)
Comment 2 Christine Caulfield 2007-10-02 09:47:15 EDT
It looks to me, like aisexec is deadlocking because it doesn't like me calling
openais_timer_add_duration() whilst in a timer callback function.

Is there another way of doing a repeating timer ?
Comment 3 Christine Caulfield 2007-10-02 10:31:14 EDT
ahhh, there seems to be a bug :-)

According to the man page:
 The pthread_equal() function shall return a non-zero value if t1 and t2
 are equal; otherwise, zero shall be returned.

So I think we need to do this in openais_timer_add_absolute() and
openais_timer_add_duration()

-       if (pthread_equal (pthread_self(), expiry_thread) == 0) {
+       if (pthread_equal (pthread_self(), expiry_thread) != 0) {
Comment 5 Lon Hohberger 2007-10-02 13:21:27 EDT
[root@tng3-5 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  tng3-1                                1 Online
  tng3-2                                2 Online
  tng3-3                                3 Online
  tng3-5                                5 Online, Local
  /dev/sdd1                             0 Online, Quorum Disk

[root@tng3-5 ~]# rpm -q openais
openais-0.80.3-6.el5


Preliminary tests pass.
Comment 6 Lon Hohberger 2007-10-02 13:39:01 EDT
[root@tng3-5 ~]# cman_tool status
Version: 6.0.1
Config Version: 9
Cluster Name: tng3-cluster
Cluster Id: 41908
Cluster Member: Yes
Cluster Generation: 2448
Membership state: Cluster-Member
Nodes: 4
Expected votes: 4
Total votes: 7
Quorum: 4  
Active subsystems: 9
Flags: 
Ports Bound: 0 11 177  
Node name: tng3-5
Node ID: 5
Multicast addresses: 239.192.163.88 
Node addresses: 10.15.89.178 
[root@tng3-5 ~]# cman_tool members
cman_tool: unknown option members
[root@tng3-5 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2007-10-02 12:15:25  /dev/sdd1
   1   M   2448   2007-10-02 12:15:11  tng3-1
   2   M   2448   2007-10-02 12:15:11  tng3-2
   3   M   2448   2007-10-02 12:15:11  tng3-3
   5   M   2444   2007-10-02 12:15:11  tng3-5

going to do some functional tests.
Comment 12 errata-xmlrpc 2007-11-07 12:00:18 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0599.html
Comment 13 Lon Hohberger 2007-11-16 08:50:56 EST
*** Bug 253836 has been marked as a duplicate of this bug. ***
Comment 14 Lon Hohberger 2008-02-07 09:04:56 EST
*** Bug 431382 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.