Hide Forgot
Description of problem: The fenced process can no longer shut down if corosync exits ungracefully. This prevents cleanup (such as "service cman stop") from proceeding (cluster has been in this state for 36+ hours). Version-Release number of selected component (if applicable): cman-3.1.1-1.fc14.x86_64 How reproducible: Seen reasonably often due to a bug in corosync Steps to Reproduce: 1. "service cman start" on one cluster node 2. killall -9 corosync 3. Actual results: (Logs from fence_tool dump) 1300575979 check_quorum not quorate 1300576021 cluster is down, exiting 1300576021 daemon cpg_dispatch error 2 1300576021 cpg_dispatch error 2 fenced does not exit Expected results: fenced _does_ exit Additional info: [root@pcmk-8 ~]# fence_tool ls fence domain member count 1 victim count 3 victim now 0 master nodeid 0 wait state quorum members After attaching with gdb, the stack was as follows: sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:106 106 movq %rax, %r9 (gdb) where #0 sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:106 #1 0x00007eff48b31895 in ipc_sem_wait (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at ../include/corosync/coroipc_ipc.h:215 #2 reply_receive (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at coroipcc.c:515 #3 coroipcc_msg_send_reply_receive (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at coroipcc.c:971 #4 0x00007eff4a34bc89 in cpg_finalize (handle=<value optimized out>) at cpg.c:238 #5 0x00000000004094ea in close_cpg_daemon () at /usr/src/debug/cluster-3.1.1/fence/fenced/cpg.c:2381 #6 0x000000000040c833 in loop (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/cluster-3.1.1/fence/fenced/main.c:831 #7 main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/cluster-3.1.1/fence/fenced/main.c:1041
verified by running testcpg with the dispatch code changed to not exit if disatch returns an error and stopping corosync prior to testcpg.
Created attachment 486746 [details] My test case - works as it should = DOESN'T reproduce problem
Steve, I've tried test case in attachment 486746 [details]. It doesn't led to anything abnormal. I've also tried following script: for i in `seq 1 100`;do service cman start && sleep 2 && killall -9 corosync && sleep 5 && pidof fenced && exit;done Again, unsuccessfully. (I didn't noticed, that you resigned bug to you)
I've attached my test case which simulates this scenario. Comment #1 was in error - the upstream testcpg.c is missing a quit = 1 assignment. It appears that cpg_finalize works as expected in non racy conditions. Perhaps there is some racy condition.
Created attachment 486830 [details] testcpg which should duplicate issue but does not.
Created attachment 487092 [details] another testcpg which creates 100 threads and initializes/finalizes after a kill this test case also doesn't reproduce the problem.
More detail from Andrew: He triggered the failure in corosync by issuing an ipc command to a nonexistent service engine. After this operation, fenced locked up and failed to exit. Specifically Andrew issued an ipc command to service engine 9 (cman) with cman loaded after the system was in a stable state. cman has no ipc functions. Not sure how this would result in a lockup in that ipc client function, but worth more investigation to determine if this is a general bug, or a "user error" bug which triggers odd behavior. If it is user error, I am inclined to close this bug as wontfix, unless there is an easy solution to the problem. Honza, try making a small program which does some ipc that would trigger a crash and see if fenced locks up in that situation. If it does, identify whether it is user error only that would generate this problem or a generic system problem. Regards -steve
Steve, Andrew, for me, the description of: "Specifically Andrew issued an ipc command to service engine 9 (cman) with cman loaded after the system was in a stable state" and " After attaching with gdb, the stack was as follows: sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:106 106 movq %rax, %r9 (gdb) where #0 sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:106 #1 0x00007eff48b31895 in ipc_sem_wait (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at ../include/corosync/coroipc_ipc.h:215 #2 reply_receive (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at coroipcc.c:515 #3 coroipcc_msg_send_reply_receive (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at coroipcc.c:971 #4 0x00007eff4a34bc89 in cpg_finalize (handle=<value optimized out>) at cpg.c:238 #5 0x00000000004094ea in close_cpg_daemon () at /usr/src/debug/cluster-3.1.1/fence/fenced/cpg.c:2381 #6 0x000000000040c833 in loop (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/cluster-3.1.1/fence/fenced/main.c:831 #7 main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/cluster-3.1.1/fence/fenced/main.c:1041 " seems to be really two different things and two different problems. Because gdb backtrace seems to be VALID lib call (I was not able reproduce). Second one = improper use of coroipc. I found one little problem in coroipcs (patch included in next comment). Andrew, are you really sure that gdb trace and invalid service call are same test/fall case???
Created attachment 488470 [details] Deny connect to service without initfn
(In reply to comment #8) > Andrew, are you really sure that gdb trace and invalid service call are same > test/fall case??? The timing seems right. But I could be mistaken.
(In reply to comment #10) > (In reply to comment #8) > > Andrew, are you really sure that gdb trace and invalid service call are same > > test/fall case??? > > The timing seems right. But I could be mistaken. Ok, I will try to ask again in different way. From gdb stacktrace it looks like you didn't do any change in lib, right? If so, there is no way how to call cman IPC and get such backtrace. Also from stack trace, there can be seen that code is not stucked at all. It's in sem_timedwait function, and this is in corosync called with 2 sec delay. Are you sure, that you didn't change fenced in some way that it was calling cpg_finalize forever? I know, stupid question, but I'm really really not able to find main problem. So If you are able to reproduce, can you please ether: - give me access to that machine - step few instructions in gdb to see where code loops
(In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #8) > > > Andrew, are you really sure that gdb trace and invalid service call are same > > > test/fall case??? > > > > The timing seems right. But I could be mistaken. > > Ok, I will try to ask again in different way. > > From gdb stacktrace it looks like you didn't do any change in lib, right? If > so, there is no way how to call cman IPC and get such backtrace. Its not calling cman IPC. So a long time ago pacemaker was using slot 9 for its plugin. Then it was decided that pacemaker should use slot 10, but people were already deploying cluster with 9 by then so we made it a configure time option. So crm_node is trying to do IPC to slot 9 thinking that our plugin will be on the other end. Except the cman plugin is there and the world goes boom. While its not exactly ideal that pacemaker does this (we've since made changes to avoid it), Steve and I agree that it shouldn't be possible to crash corosync in this way. > > Also from stack trace, there can be seen that code is not stucked at all. It's > in sem_timedwait function, and this is in corosync called with 2 sec delay. > > Are you sure, that you didn't change fenced in some way that it was calling > cpg_finalize forever? I did not touch it in any way. > I know, stupid question, but I'm really really not able > to find main problem. So If you are able to reproduce, can you please ether: > - give me access to that machine > - step few instructions in gdb to see where code loops With cman loaded, run: crm_node -p -A
(In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #10) > > > (In reply to comment #8) > > > > Andrew, are you really sure that gdb trace and invalid service call are same > > > > test/fall case??? > > > > > > The timing seems right. But I could be mistaken. > > > > Ok, I will try to ask again in different way. > > > > From gdb stacktrace it looks like you didn't do any change in lib, right? If > > so, there is no way how to call cman IPC and get such backtrace. > > Its not calling cman IPC. > > So a long time ago pacemaker was using slot 9 for its plugin. > Then it was decided that pacemaker should use slot 10, but people were already > deploying cluster with 9 by then so we made it a configure time option. > > So crm_node is trying to do IPC to slot 9 thinking that our plugin will be on > the other end. > Except the cman plugin is there and the world goes boom. > > While its not exactly ideal that pacemaker does this (we've since made changes > to avoid it), Steve and I agree that it shouldn't be possible to crash corosync > in this way. > > > > > Also from stack trace, there can be seen that code is not stucked at all. It's > > in sem_timedwait function, and this is in corosync called with 2 sec delay. > > > > Are you sure, that you didn't change fenced in some way that it was calling > > cpg_finalize forever? > > I did not touch it in any way. > > > I know, stupid question, but I'm really really not able > > to find main problem. So If you are able to reproduce, can you please ether: > > - give me access to that machine > > - step few instructions in gdb to see where code loops > > With cman loaded, run: > crm_node -p -A Andrew, what result I should get if I run crm_node -p -A? If it is a corosync crash, then great, I'm able to reproduce it, and it's even fixed https://bugzilla.redhat.com/show_bug.cgi?id=688904. But what I don't understand is, where the fenced will come to play. If I understand it correctly, following scenario: - FC14 - start cman - crm_node -p -A ---> corosync crash, different bug now if I will try to quit fenced, it should stuck. Correct? Regards, Honza
(In reply to comment #13) > Andrew, > what result I should get if I run crm_node -p -A? If it is a corosync crash, > then great, I'm able to reproduce it, and it's even fixed > https://bugzilla.redhat.com/show_bug.cgi?id=688904. Correct > But what I don't understand is, where the fenced will come to play. If I > understand it correctly, following scenario: > - FC14 > - start cman > - crm_node -p -A ---> corosync crash, different bug > > now if I will try to quit fenced, it should stuck. Correct? Also correct. The stack I saw was fenced trying to clean up and exit after corosync crashed/dissappeared.
As discussed on IRC, I'm closing this bug. I was not able to reproduce with following script: for i in `seq 100`;do service cman start ; sleep 1; crm_node -p -A; service cman stop;done I'm unsure that bug is really in corosync IPC libraries. It can also be in here: list_for_each_entry(fd, &domains, list) { if (fd->cpg_handle) cpg_finalize(fd->cpg_handle); } If list is corrupted it can loop forever. Andrew will reopen bug if hit problem again.