Description of problem: When corosync exits, my application (fenced) gets stuck. # strace -p 2005 Process 2005 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out) poll([{fd=14, events=0}], 1, 0) = 1 ([{fd=14, revents=POLLNVAL}]) gettimeofday({1271185487, 264}, NULL) = 0 futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185489, 0} , ffffffff) = -1 ETIMEDOUT (Connection timed out) poll([{fd=14, events=0}], 1, 0) = 1 ([{fd=14, revents=POLLNVAL}]) gettimeofday({1271185489, 198}, NULL) = 0 futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185491, +0}, ffffffff) = -1 ETIMEDOUT (Connection timed out) 0x000000338d00d417 in sem_timedwait () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install cman-3.0.7-1.fc12.x86_64 (gdb) bt #0 0x000000338d00d417 in sem_timedwait () from /lib64/libpthread.so.0 #1 0x0000003713e02311 in reply_receive (ipc_instance=0x2379ed0, res_msg=0x7ffff68c6a50, res_len=16) at coroipcc.c:476 #2 0x0000003713e02e7e in coroipcc_msg_send_reply_receive ( handle=3265522690949120001, iov=0x7ffff68c6a80, iov_len=1, res_msg=0x7ffff68c6a50, res_len=16) at coroipcc.c:1045 #3 0x0000003713a01ed3 in cpg_finalize (handle=5902762718137417729) at cpg.c:238 #4 0x0000000000403542 in close_cpg_daemon () at /root/stable3/fence/fenced/cpg.c:2311 #5 0x000000000040b26d in loop (argc=<value optimized out>, argv=<value optimized out>) at /root/stable3/fence/fenced/main.c:831 #6 main (argc=<value optimized out>, argv=<value optimized out>) at /root/stable3/fence/fenced/main.c:1045 corosync process has exited. Using trunk. I think it happens every time corosync is killed via cman. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
My cluster is busy with something else at the moment, but to reproduce this I'd first try: service cman start on node1 and node2 cman_tool kill -n node1 from node2 check if fenced is stuck on node1 If that doesn't do it, I'd try service cman start on node1, node2, node3 create network partition: node1 | node2, node3 remove network partition node2 or node3 should kill node1 check if fenced is stuck on node1
Verified that cman_tool kill will reproduce the problem (only difference is that I have four nodes in my cluster). I repeated the test twice, the problem reproduced on the second try. Also note that I'm using the latest cpg patch adding the totem/ringid callbacks.
Jan, good point, the fenced I'm using is updated to use the new cpg_model_initialize api. I'll send you a patch with the fenced changes.
Created attachment 407079 [details] Proposed patch Patch which handles POLLNVAL. Also return value of poll is now better handled.
Thanks, I'll try the patch. Sorry I didn't get you the fenced patch I'm using, I was too busy debugging it and forgot.
Honza, using the patch, I've tried both tests above a couple times and have not see fenced get stuck. I'll try a few more times next week and let you know.
Created attachment 407208 [details] fenced patch using new cpg api Here's the fenced version that I was seeing troubles with in case you'd like to try it.
Dave, I was trying reproduce the bug (without patch I sent and WITH fenced patch you sent) - unsuccessfully. Are you using Fedora rawhide? If so, it looks for me like incompatibility in way how poll works and what returns in new kernel?/glibc?/???. Anyway that coroipcc part was not very well written, so patch should be included in corosync.
Tried this several more times using the patch and haven't seen the hang, so I suggest we call it a fix. (Using F12 with recent devel kernel.)
Patch is now included in upstream as svn revision 2789, so I'm closing this bug.