Bug 582313 - stuck on sem_timedwait
Summary: stuck on sem_timedwait
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: corosync
Version: rawhide
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Jan Friesse
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 582326
TreeView+ depends on / blocked
 
Reported: 2010-04-14 15:48 UTC by David Teigland
Modified: 2010-04-26 16:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 582326 (view as bug list)
Environment:
Last Closed: 2010-04-26 16:18:02 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Proposed patch (592 bytes, patch)
2010-04-16 11:38 UTC, Jan Friesse
no flags Details | Diff
fenced patch using new cpg api (9.72 KB, text/plain)
2010-04-16 23:14 UTC, David Teigland
no flags Details

Description David Teigland 2010-04-14 15:48:08 UTC
Description of problem:


When corosync exits, my application (fenced) gets stuck.

# strace -p 2005                                                                
Process 2005 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection
timed out)
poll([{fd=14, events=0}], 1, 0)         = 1 ([{fd=14, revents=POLLNVAL}])
gettimeofday({1271185487, 264}, NULL)   = 0
futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185489, 0}
, ffffffff) = -1 ETIMEDOUT (Connection timed out)
poll([{fd=14, events=0}], 1, 0)         = 1 ([{fd=14, revents=POLLNVAL}])
gettimeofday({1271185489, 198}, NULL)   = 0
futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185491,
+0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

0x000000338d00d417 in sem_timedwait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install cman-3.0.7-1.fc12.x86_64
(gdb) bt
#0  0x000000338d00d417 in sem_timedwait () from /lib64/libpthread.so.0          
#1  0x0000003713e02311 in reply_receive (ipc_instance=0x2379ed0,                
    res_msg=0x7ffff68c6a50, res_len=16) at coroipcc.c:476
#2  0x0000003713e02e7e in coroipcc_msg_send_reply_receive (                     
    handle=3265522690949120001, iov=0x7ffff68c6a80, iov_len=1,
    res_msg=0x7ffff68c6a50, res_len=16) at coroipcc.c:1045
#3  0x0000003713a01ed3 in cpg_finalize (handle=5902762718137417729)             
    at cpg.c:238
#4  0x0000000000403542 in close_cpg_daemon ()                                   
    at /root/stable3/fence/fenced/cpg.c:2311
#5  0x000000000040b26d in loop (argc=<value optimized out>,                     
    argv=<value optimized out>) at /root/stable3/fence/fenced/main.c:831
#6  main (argc=<value optimized out>, argv=<value optimized out>)               
    at /root/stable3/fence/fenced/main.c:1045

corosync process has exited.  Using trunk.  I think it happens every time corosync is killed via cman.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-04-14 17:47:21 UTC
My cluster is busy with something else at the moment, but to reproduce this I'd first try:

service cman start on node1 and node2
cman_tool kill -n node1 from node2
check if fenced is stuck on node1

If that doesn't do it, I'd try

service cman start on node1, node2, node3
create network partition: node1 | node2, node3
remove network partition
node2 or node3 should kill node1
check if fenced is stuck on node1

Comment 2 David Teigland 2010-04-14 19:41:34 UTC
Verified that cman_tool kill will reproduce the problem (only difference is that I have four nodes in my cluster).  I repeated the test twice, the problem reproduced on the second try.

Also note that I'm using the latest cpg patch adding the totem/ringid callbacks.

Comment 4 David Teigland 2010-04-15 15:46:05 UTC
Jan, good point, the fenced I'm using is updated to use the new cpg_model_initialize api.  I'll send you a patch with the fenced changes.

Comment 5 Jan Friesse 2010-04-16 11:38:17 UTC
Created attachment 407079 [details]
Proposed patch

Patch which handles POLLNVAL. Also return value of poll is now better handled.

Comment 6 David Teigland 2010-04-16 15:42:00 UTC
Thanks, I'll try the patch.  Sorry I didn't get you the fenced patch I'm using, I was too busy debugging it and forgot.

Comment 7 David Teigland 2010-04-16 23:13:29 UTC
Honza, using the patch, I've tried both tests above a couple times and have not see fenced get stuck.  I'll try a few more times next week and let you know.

Comment 8 David Teigland 2010-04-16 23:14:33 UTC
Created attachment 407208 [details]
fenced patch using new cpg api

Here's the fenced version that I was seeing troubles with in case you'd like to try it.

Comment 9 Jan Friesse 2010-04-19 14:07:39 UTC
Dave,
I was trying reproduce the bug (without patch I sent and WITH fenced patch you sent) - unsuccessfully. Are you using Fedora rawhide? If so, it looks for me like incompatibility in way how poll works and what returns in new kernel?/glibc?/???.

Anyway that coroipcc part was not very well written, so patch should be included in corosync.

Comment 10 David Teigland 2010-04-19 20:20:35 UTC
Tried this several more times using the patch and haven't seen the hang, so I suggest we call it a fix.

(Using F12 with recent devel kernel.)

Comment 11 Jan Friesse 2010-04-26 16:18:02 UTC
Patch is now included in upstream as svn revision 2789, so I'm closing this bug.


Note You need to log in before you can comment on or make changes to this bug.