Bug 582313

Summary: stuck on sem_timedwait
Product: [Fedora] Fedora Reporter: David Teigland <teigland>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: rawhideCC: agk, fdinitto, sdake
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 582326 (view as bug list) Environment:
Last Closed: 2010-04-26 16:18:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 582326    
Attachments:
Description Flags
Proposed patch
none
fenced patch using new cpg api none

Description David Teigland 2010-04-14 15:48:08 UTC
Description of problem:


When corosync exits, my application (fenced) gets stuck.

# strace -p 2005                                                                
Process 2005 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection
timed out)
poll([{fd=14, events=0}], 1, 0)         = 1 ([{fd=14, revents=POLLNVAL}])
gettimeofday({1271185487, 264}, NULL)   = 0
futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185489, 0}
, ffffffff) = -1 ETIMEDOUT (Connection timed out)
poll([{fd=14, events=0}], 1, 0)         = 1 ([{fd=14, revents=POLLNVAL}])
gettimeofday({1271185489, 198}, NULL)   = 0
futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185491,
+0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

0x000000338d00d417 in sem_timedwait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install cman-3.0.7-1.fc12.x86_64
(gdb) bt
#0  0x000000338d00d417 in sem_timedwait () from /lib64/libpthread.so.0          
#1  0x0000003713e02311 in reply_receive (ipc_instance=0x2379ed0,                
    res_msg=0x7ffff68c6a50, res_len=16) at coroipcc.c:476
#2  0x0000003713e02e7e in coroipcc_msg_send_reply_receive (                     
    handle=3265522690949120001, iov=0x7ffff68c6a80, iov_len=1,
    res_msg=0x7ffff68c6a50, res_len=16) at coroipcc.c:1045
#3  0x0000003713a01ed3 in cpg_finalize (handle=5902762718137417729)             
    at cpg.c:238
#4  0x0000000000403542 in close_cpg_daemon ()                                   
    at /root/stable3/fence/fenced/cpg.c:2311
#5  0x000000000040b26d in loop (argc=<value optimized out>,                     
    argv=<value optimized out>) at /root/stable3/fence/fenced/main.c:831
#6  main (argc=<value optimized out>, argv=<value optimized out>)               
    at /root/stable3/fence/fenced/main.c:1045

corosync process has exited.  Using trunk.  I think it happens every time corosync is killed via cman.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-04-14 17:47:21 UTC
My cluster is busy with something else at the moment, but to reproduce this I'd first try:

service cman start on node1 and node2
cman_tool kill -n node1 from node2
check if fenced is stuck on node1

If that doesn't do it, I'd try

service cman start on node1, node2, node3
create network partition: node1 | node2, node3
remove network partition
node2 or node3 should kill node1
check if fenced is stuck on node1

Comment 2 David Teigland 2010-04-14 19:41:34 UTC
Verified that cman_tool kill will reproduce the problem (only difference is that I have four nodes in my cluster).  I repeated the test twice, the problem reproduced on the second try.

Also note that I'm using the latest cpg patch adding the totem/ringid callbacks.

Comment 4 David Teigland 2010-04-15 15:46:05 UTC
Jan, good point, the fenced I'm using is updated to use the new cpg_model_initialize api.  I'll send you a patch with the fenced changes.

Comment 5 Jan Friesse 2010-04-16 11:38:17 UTC
Created attachment 407079 [details]
Proposed patch

Patch which handles POLLNVAL. Also return value of poll is now better handled.

Comment 6 David Teigland 2010-04-16 15:42:00 UTC
Thanks, I'll try the patch.  Sorry I didn't get you the fenced patch I'm using, I was too busy debugging it and forgot.

Comment 7 David Teigland 2010-04-16 23:13:29 UTC
Honza, using the patch, I've tried both tests above a couple times and have not see fenced get stuck.  I'll try a few more times next week and let you know.

Comment 8 David Teigland 2010-04-16 23:14:33 UTC
Created attachment 407208 [details]
fenced patch using new cpg api

Here's the fenced version that I was seeing troubles with in case you'd like to try it.

Comment 9 Jan Friesse 2010-04-19 14:07:39 UTC
Dave,
I was trying reproduce the bug (without patch I sent and WITH fenced patch you sent) - unsuccessfully. Are you using Fedora rawhide? If so, it looks for me like incompatibility in way how poll works and what returns in new kernel?/glibc?/???.

Anyway that coroipcc part was not very well written, so patch should be included in corosync.

Comment 10 David Teigland 2010-04-19 20:20:35 UTC
Tried this several more times using the patch and haven't seen the hang, so I suggest we call it a fix.

(Using F12 with recent devel kernel.)

Comment 11 Jan Friesse 2010-04-26 16:18:02 UTC
Patch is now included in upstream as svn revision 2789, so I'm closing this bug.