Bug 314641

Summary: CMAN dies after qdiskd calls cman_poll_quorum_device()
Product: Red Hat Enterprise Linux 5 Reporter: Lon Hohberger <lhh>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.1CC: cluster-maint, gustavo.prada, h.plankl, pkennedy, rkenna, sdake
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0599 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 17:00:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 253836    
Attachments:
Description Flags
Basic debugging (thread backtrace from aisexec on all nodes, qdisk backtrace on all nodes but 1, cman_tool backtrace) none

Description Lon Hohberger 2007-10-01 19:56:00 UTC
Description of problem:


Version-Release number of selected component (if applicable): 2.0.70-1.el5
How reproducible: ???
Actual results: CMAN hangs.  At this point, cman_tool stops working, and all
calls to libcman block.  For example:

(gdb) thr a a bt

Thread 2 (Thread -1208284272 (LWP 2006)):
#0  0x0080f402 in __kernel_vsyscall ()
#1  0x0093f846 in nanosleep () from /lib/libc.so.6
#2  0x0093f66f in sleep () from /lib/libc.so.6
#3  0x080503b1 in score_thread_main (arg=0x8b9d238) at score.c:371
#4  0x00a2743b in start_thread () from /lib/libpthread.so.0
#5  0x0097efde in clone () from /lib/libc.so.6

Thread 1 (Thread -1208281408 (LWP 2005)):
#0  0x0080f402 in __kernel_vsyscall ()
#1  0x00a2e118 in recv () from /lib/libpthread.so.0
#2  0x080523d7 in cman_dispatch (handle=0x8b9d008, flags=26) at libcman.c:522
#3  0x08051511 in wait_for_reply (h=0x8b9d008, msg=0x8ba7208, max_len=6784)
    at libcman.c:80
#4  0x08051b2a in info_call (h=0x8b9d008, msgtype=7, inbuf=0x0, inlen=0, 
    outbuf=0x8ba7208, outlen=6784) at libcman.c:287
#5  0x080526fa in cman_get_nodes (handle=0x8b9d008, maxnodes=16, 
    retnodes=0xbfe9f1f0, nodes=0xbfe9f1f4) at libcman.c:606
#6  0x0804cf60 in check_cman (ctx=0xbfea1084, mask=0xbfea05c8 "\001", 
    master_mask=0xbfea05c0 "\001") at main.c:551
#7  0x0804e27a in quorum_loop (ctx=0xbfea1084, ni=0xbfea0804, max=16)
    at main.c:1020
#8  0x0804fa2d in main (argc=2, argv=0xbfea14b4) at main.c:1536
#9  0x008c6dec in __libc_start_main () from /lib/libc.so.6
#10 0x08049591 in _start ()
#0  0x0080f402 in __kernel_vsyscall ()
(gdb) 

Expected results: Calls to libcman / cman_tool should never get stuck this way.

Additional info: openais-0.80.3-5.el5

Comment 1 Lon Hohberger 2007-10-01 19:56:00 UTC
Created attachment 212761 [details]
Basic debugging (thread backtrace from aisexec on all nodes, qdisk backtrace on all nodes but 1, cman_tool backtrace)

Comment 2 Christine Caulfield 2007-10-02 13:47:15 UTC
It looks to me, like aisexec is deadlocking because it doesn't like me calling
openais_timer_add_duration() whilst in a timer callback function.

Is there another way of doing a repeating timer ?

Comment 3 Christine Caulfield 2007-10-02 14:31:14 UTC
ahhh, there seems to be a bug :-)

According to the man page:
 The pthread_equal() function shall return a non-zero value if t1 and t2
 are equal; otherwise, zero shall be returned.

So I think we need to do this in openais_timer_add_absolute() and
openais_timer_add_duration()

-       if (pthread_equal (pthread_self(), expiry_thread) == 0) {
+       if (pthread_equal (pthread_self(), expiry_thread) != 0) {


Comment 5 Lon Hohberger 2007-10-02 17:21:27 UTC
[root@tng3-5 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  tng3-1                                1 Online
  tng3-2                                2 Online
  tng3-3                                3 Online
  tng3-5                                5 Online, Local
  /dev/sdd1                             0 Online, Quorum Disk

[root@tng3-5 ~]# rpm -q openais
openais-0.80.3-6.el5


Preliminary tests pass.

Comment 6 Lon Hohberger 2007-10-02 17:39:01 UTC
[root@tng3-5 ~]# cman_tool status
Version: 6.0.1
Config Version: 9
Cluster Name: tng3-cluster
Cluster Id: 41908
Cluster Member: Yes
Cluster Generation: 2448
Membership state: Cluster-Member
Nodes: 4
Expected votes: 4
Total votes: 7
Quorum: 4  
Active subsystems: 9
Flags: 
Ports Bound: 0 11 177  
Node name: tng3-5
Node ID: 5
Multicast addresses: 239.192.163.88 
Node addresses: 10.15.89.178 
[root@tng3-5 ~]# cman_tool members
cman_tool: unknown option members
[root@tng3-5 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2007-10-02 12:15:25  /dev/sdd1
   1   M   2448   2007-10-02 12:15:11  tng3-1
   2   M   2448   2007-10-02 12:15:11  tng3-2
   3   M   2448   2007-10-02 12:15:11  tng3-3
   5   M   2444   2007-10-02 12:15:11  tng3-5

going to do some functional tests.

Comment 12 errata-xmlrpc 2007-11-07 17:00:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0599.html


Comment 13 Lon Hohberger 2007-11-16 13:50:56 UTC
*** Bug 253836 has been marked as a duplicate of this bug. ***

Comment 14 Lon Hohberger 2008-02-07 14:04:56 UTC
*** Bug 431382 has been marked as a duplicate of this bug. ***