Bug 582313

Summary:

stuck on sem_timedwait

Product:

[Fedora] Fedora

Reporter:

David Teigland <teigland>

Component:

corosync

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED UPSTREAM

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

low

Version:

rawhide

CC:

agk, fdinitto, sdake

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

582326 (view as bug list)

Environment:

Last Closed:

2010-04-26 16:18:02 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

582326

Attachments:

Description	Flags
Proposed patch	none
fenced patch using new cpg api	none

Description David Teigland 2010-04-14 15:48:08 UTC

Description of problem:


When corosync exits, my application (fenced) gets stuck.

# strace -p 2005                                                                
Process 2005 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection
timed out)
poll([{fd=14, events=0}], 1, 0)         = 1 ([{fd=14, revents=POLLNVAL}])
gettimeofday({1271185487, 264}, NULL)   = 0
futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185489, 0}
, ffffffff) = -1 ETIMEDOUT (Connection timed out)
poll([{fd=14, events=0}], 1, 0)         = 1 ([{fd=14, revents=POLLNVAL}])
gettimeofday({1271185489, 198}, NULL)   = 0
futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185491,
+0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

0x000000338d00d417 in sem_timedwait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install cman-3.0.7-1.fc12.x86_64
(gdb) bt
#0  0x000000338d00d417 in sem_timedwait () from /lib64/libpthread.so.0          
#1  0x0000003713e02311 in reply_receive (ipc_instance=0x2379ed0,                
    res_msg=0x7ffff68c6a50, res_len=16) at coroipcc.c:476
#2  0x0000003713e02e7e in coroipcc_msg_send_reply_receive (                     
    handle=3265522690949120001, iov=0x7ffff68c6a80, iov_len=1,
    res_msg=0x7ffff68c6a50, res_len=16) at coroipcc.c:1045
#3  0x0000003713a01ed3 in cpg_finalize (handle=5902762718137417729)             
    at cpg.c:238
#4  0x0000000000403542 in close_cpg_daemon ()                                   
    at /root/stable3/fence/fenced/cpg.c:2311
#5  0x000000000040b26d in loop (argc=<value optimized out>,                     
    argv=<value optimized out>) at /root/stable3/fence/fenced/main.c:831
#6  main (argc=<value optimized out>, argv=<value optimized out>)               
    at /root/stable3/fence/fenced/main.c:1045

corosync process has exited.  Using trunk.  I think it happens every time corosync is killed via cman.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-04-14 17:47:21 UTC

My cluster is busy with something else at the moment, but to reproduce this I'd first try:

service cman start on node1 and node2
cman_tool kill -n node1 from node2
check if fenced is stuck on node1

If that doesn't do it, I'd try

service cman start on node1, node2, node3
create network partition: node1 | node2, node3
remove network partition
node2 or node3 should kill node1
check if fenced is stuck on node1

Comment 2 David Teigland 2010-04-14 19:41:34 UTC

Verified that cman_tool kill will reproduce the problem (only difference is that I have four nodes in my cluster).  I repeated the test twice, the problem reproduced on the second try.

Also note that I'm using the latest cpg patch adding the totem/ringid callbacks.

Comment 4 David Teigland 2010-04-15 15:46:05 UTC

Jan, good point, the fenced I'm using is updated to use the new cpg_model_initialize api.  I'll send you a patch with the fenced changes.

Comment 5 Jan Friesse 2010-04-16 11:38:17 UTC

Created attachment 407079 [details]
Proposed patch

Patch which handles POLLNVAL. Also return value of poll is now better handled.

Comment 6 David Teigland 2010-04-16 15:42:00 UTC

Thanks, I'll try the patch.  Sorry I didn't get you the fenced patch I'm using, I was too busy debugging it and forgot.

Comment 7 David Teigland 2010-04-16 23:13:29 UTC

Honza, using the patch, I've tried both tests above a couple times and have not see fenced get stuck.  I'll try a few more times next week and let you know.

Comment 8 David Teigland 2010-04-16 23:14:33 UTC

Created attachment 407208 [details]
fenced patch using new cpg api

Here's the fenced version that I was seeing troubles with in case you'd like to try it.

Comment 9 Jan Friesse 2010-04-19 14:07:39 UTC

Dave,
I was trying reproduce the bug (without patch I sent and WITH fenced patch you sent) - unsuccessfully. Are you using Fedora rawhide? If so, it looks for me like incompatibility in way how poll works and what returns in new kernel?/glibc?/???.

Anyway that coroipcc part was not very well written, so patch should be included in corosync.

Comment 10 David Teigland 2010-04-19 20:20:35 UTC

Tried this several more times using the patch and haven't seen the hang, so I suggest we call it a fix.

(Using F12 with recent devel kernel.)

Comment 11 Jan Friesse 2010-04-26 16:18:02 UTC

Patch is now included in upstream as svn revision 2789, so I'm closing this bug.