689418 – fenced hangs if corosync crashes/exits from underneath it

Bug 689418 - fenced hangs if corosync crashes/exits from underneath it

Summary: fenced hangs if corosync crashes/exits from underneath it

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	corosync
Sub Component:
Version:	14
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jan Friesse
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-21 13:00 UTC by Andrew Beekhof
Modified:	2011-04-15 07:12 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-04-15 07:12:10 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
My test case - works as it should = DOESN'T reproduce problem (7.28 KB, text/plain) 2011-03-22 08:13 UTC, Jan Friesse	no flags	Details
testcpg which should duplicate issue but does not. (7.23 KB, text/plain) 2011-03-22 15:53 UTC, Steven Dake	no flags	Details
another testcpg which creates 100 threads and initializes/finalizes after a kill (6.10 KB, text/plain) 2011-03-23 17:23 UTC, Steven Dake	no flags	Details
Deny connect to service without initfn (1.17 KB, patch) 2011-03-29 14:40 UTC, Jan Friesse	no flags	Details \| Diff
View All

Description Andrew Beekhof 2011-03-21 13:00:27 UTC

Description of problem:

The fenced process can no longer shut down if corosync exits ungracefully.
This prevents cleanup (such as "service cman stop") from proceeding (cluster has been in this state for 36+ hours).

Version-Release number of selected component (if applicable):

cman-3.1.1-1.fc14.x86_64

How reproducible:

Seen reasonably often due to a bug in corosync

Steps to Reproduce:
1.  "service cman start" on one cluster node
2.  killall -9 corosync
3.  
  
Actual results:

(Logs from fence_tool dump)
1300575979 check_quorum not quorate
1300576021 cluster is down, exiting
1300576021 daemon cpg_dispatch error 2
1300576021 cpg_dispatch error 2

fenced does not exit

Expected results:

fenced _does_ exit

Additional info:

[root@pcmk-8 ~]# fence_tool ls
fence domain
member count  1
victim count  3
victim now    0
master nodeid 0
wait state    quorum
members       

After attaching with gdb, the stack was as follows:

sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:106
106		movq	%rax, %r9
(gdb) where
#0  sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:106
#1  0x00007eff48b31895 in ipc_sem_wait (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at ../include/corosync/coroipc_ipc.h:215
#2  reply_receive (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at coroipcc.c:515
#3  coroipcc_msg_send_reply_receive (handle=<value optimized out>, iov=<value optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at coroipcc.c:971
#4  0x00007eff4a34bc89 in cpg_finalize (handle=<value optimized out>) at cpg.c:238
#5  0x00000000004094ea in close_cpg_daemon () at /usr/src/debug/cluster-3.1.1/fence/fenced/cpg.c:2381
#6  0x000000000040c833 in loop (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/cluster-3.1.1/fence/fenced/main.c:831
#7  main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/cluster-3.1.1/fence/fenced/main.c:1041

Comment 1 Steven Dake 2011-03-22 00:22:48 UTC

verified by running testcpg with the dispatch code changed to not exit if disatch returns an error and stopping corosync prior to testcpg.

Comment 2 Jan Friesse 2011-03-22 08:13:08 UTC

Created attachment 486746 [details]
My test case - works as it should = DOESN'T reproduce problem

Comment 3 Jan Friesse 2011-03-22 08:18:12 UTC

Steve, I've tried test case in attachment 486746 [details]. It doesn't led to anything abnormal. I've also tried following script:

for i in `seq 1 100`;do service cman start && sleep 2 && killall -9 corosync && sleep 5 && pidof fenced && exit;done

Again, unsuccessfully.

(I didn't noticed, that you resigned bug to you)

Comment 4 Steven Dake 2011-03-22 15:52:59 UTC

I've attached my test case which simulates this scenario.  Comment #1 was in error - the upstream testcpg.c is missing a quit = 1 assignment.

It appears that cpg_finalize works as expected in non racy conditions.  Perhaps there is some racy condition.

Comment 5 Steven Dake 2011-03-22 15:53:59 UTC

Created attachment 486830 [details]
testcpg which should duplicate issue but does not.

Comment 6 Steven Dake 2011-03-23 17:23:04 UTC

Created attachment 487092 [details]
another testcpg which creates 100 threads and initializes/finalizes after a kill

this test case also doesn't reproduce the problem.

Comment 7 Steven Dake 2011-03-23 17:30:06 UTC

More detail from Andrew:
He triggered the failure in corosync by issuing an ipc command to a nonexistent service engine.  After this operation, fenced locked up and failed to exit.  Specifically Andrew issued an ipc command to service engine 9 (cman) with cman loaded after the system was in a stable state.  cman has no ipc functions.  Not sure how this would result in a lockup in that ipc client function, but worth more investigation to determine if this is a general bug, or a "user error" bug which triggers odd behavior.  If it is user error, I am inclined to close this bug as wontfix, unless there is an easy solution to the problem.

Honza,

try making a small program which does some ipc that would trigger a crash and see if fenced locks up in that situation.  If it does, identify whether it is user error only that would generate this problem or a generic system problem.

Regards
-steve

Comment 8 Jan Friesse 2011-03-29 14:38:57 UTC

Steve, Andrew,
for me, the description of:
"Specifically Andrew issued an ipc command to service engine 9 (cman) with cman
loaded after the system was in a stable state"

and
"
After attaching with gdb, the stack was as follows:

sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:106
106  movq %rax, %r9
(gdb) where
#0  sem_timedwait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:106
#1  0x00007eff48b31895 in ipc_sem_wait (handle=<value optimized out>,
iov=<value optimized out>, iov_len=<value optimized out>,
res_msg=0x7fff393903a0, res_len=16) at ../include/corosync/coroipc_ipc.h:215
#2  reply_receive (handle=<value optimized out>, iov=<value optimized out>,
iov_len=<value optimized out>, res_msg=0x7fff393903a0, res_len=16) at
coroipcc.c:515
#3  coroipcc_msg_send_reply_receive (handle=<value optimized out>, iov=<value
optimized out>, iov_len=<value optimized out>, res_msg=0x7fff393903a0,
res_len=16) at coroipcc.c:971
#4  0x00007eff4a34bc89 in cpg_finalize (handle=<value optimized out>) at
cpg.c:238
#5  0x00000000004094ea in close_cpg_daemon () at
/usr/src/debug/cluster-3.1.1/fence/fenced/cpg.c:2381
#6  0x000000000040c833 in loop (argc=<value optimized out>, argv=<value
optimized out>) at /usr/src/debug/cluster-3.1.1/fence/fenced/main.c:831
#7  main (argc=<value optimized out>, argv=<value optimized out>) at
/usr/src/debug/cluster-3.1.1/fence/fenced/main.c:1041
"

seems to be really two different things and two different problems. Because gdb backtrace seems to be VALID lib call (I was not able reproduce). Second one = improper use of coroipc. I found one little problem in coroipcs (patch included in next comment).

Andrew, are you really sure that gdb trace and invalid service call are same test/fall case???

Comment 9 Jan Friesse 2011-03-29 14:40:54 UTC

Created attachment 488470 [details]
Deny connect to service without initfn

Comment 10 Andrew Beekhof 2011-04-04 10:01:30 UTC

(In reply to comment #8)
> Andrew, are you really sure that gdb trace and invalid service call are same
> test/fall case???

The timing seems right. But I could be mistaken.

Comment 11 Jan Friesse 2011-04-04 15:19:56 UTC

(In reply to comment #10)
> (In reply to comment #8)
> > Andrew, are you really sure that gdb trace and invalid service call are same
> > test/fall case???
> 
> The timing seems right. But I could be mistaken.

Ok, I will try to ask again in different way.

From gdb stacktrace it looks like you didn't do any change in lib, right? If so, there is no way how to call cman IPC and get such backtrace.

Also from stack trace, there can be seen that code is not stucked at all. It's in  sem_timedwait function, and this is in corosync called with 2 sec delay.

Are you sure, that you didn't change fenced in some way that it was calling cpg_finalize forever? I know, stupid question, but I'm really really not able to find main problem. So If you are able to reproduce, can you please ether:
- give me access to that machine
- step few instructions in gdb to see where code loops

Comment 12 Andrew Beekhof 2011-04-11 12:06:40 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > (In reply to comment #8)
> > > Andrew, are you really sure that gdb trace and invalid service call are same
> > > test/fall case???
> > 
> > The timing seems right. But I could be mistaken.
> 
> Ok, I will try to ask again in different way.
> 
> From gdb stacktrace it looks like you didn't do any change in lib, right? If
> so, there is no way how to call cman IPC and get such backtrace.

Its not calling cman IPC.

So a long time ago pacemaker was using slot 9 for its plugin.
Then it was decided that pacemaker should use slot 10, but people were already deploying cluster with 9 by then so we made it a configure time option.

So crm_node is trying to do IPC to slot 9 thinking that our plugin will be on the other end.
Except the cman plugin is there and the world goes boom.

While its not exactly ideal that pacemaker does this (we've since made changes to avoid it), Steve and I agree that it shouldn't be possible to crash corosync in this way. 

> 
> Also from stack trace, there can be seen that code is not stucked at all. It's
> in  sem_timedwait function, and this is in corosync called with 2 sec delay.
> 
> Are you sure, that you didn't change fenced in some way that it was calling
> cpg_finalize forever?

I did not touch it in any way.

> I know, stupid question, but I'm really really not able
> to find main problem. So If you are able to reproduce, can you please ether:
> - give me access to that machine
> - step few instructions in gdb to see where code loops

With cman loaded, run:
   crm_node -p -A

Comment 13 Jan Friesse 2011-04-14 14:09:46 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > (In reply to comment #10)
> > > (In reply to comment #8)
> > > > Andrew, are you really sure that gdb trace and invalid service call are same
> > > > test/fall case???
> > > 
> > > The timing seems right. But I could be mistaken.
> > 
> > Ok, I will try to ask again in different way.
> > 
> > From gdb stacktrace it looks like you didn't do any change in lib, right? If
> > so, there is no way how to call cman IPC and get such backtrace.
> 
> Its not calling cman IPC.
> 
> So a long time ago pacemaker was using slot 9 for its plugin.
> Then it was decided that pacemaker should use slot 10, but people were already
> deploying cluster with 9 by then so we made it a configure time option.
> 
> So crm_node is trying to do IPC to slot 9 thinking that our plugin will be on
> the other end.
> Except the cman plugin is there and the world goes boom.
> 
> While its not exactly ideal that pacemaker does this (we've since made changes
> to avoid it), Steve and I agree that it shouldn't be possible to crash corosync
> in this way. 
> 
> > 
> > Also from stack trace, there can be seen that code is not stucked at all. It's
> > in  sem_timedwait function, and this is in corosync called with 2 sec delay.
> > 
> > Are you sure, that you didn't change fenced in some way that it was calling
> > cpg_finalize forever?
> 
> I did not touch it in any way.
> 
> > I know, stupid question, but I'm really really not able
> > to find main problem. So If you are able to reproduce, can you please ether:
> > - give me access to that machine
> > - step few instructions in gdb to see where code loops
> 
> With cman loaded, run:
>    crm_node -p -A

Andrew,
what result I should get if I run crm_node -p -A? If it is a corosync crash, then great, I'm able to reproduce it, and it's even fixed https://bugzilla.redhat.com/show_bug.cgi?id=688904.

But what I don't understand is, where the fenced will come to play. If I understand it correctly, following scenario:
- FC14
- start cman
- crm_node -p -A ---> corosync crash, different bug

now if I will try to quit fenced, it should stuck. Correct?

Regards,
  Honza

Comment 14 Andrew Beekhof 2011-04-14 14:32:20 UTC

(In reply to comment #13)
> Andrew,
> what result I should get if I run crm_node -p -A? If it is a corosync crash,
> then great, I'm able to reproduce it, and it's even fixed
> https://bugzilla.redhat.com/show_bug.cgi?id=688904.

Correct

> But what I don't understand is, where the fenced will come to play. If I
> understand it correctly, following scenario:
> - FC14
> - start cman
> - crm_node -p -A ---> corosync crash, different bug
> 
> now if I will try to quit fenced, it should stuck. Correct?

Also correct. The stack I saw was fenced trying to clean up and exit after corosync crashed/dissappeared.

Comment 15 Jan Friesse 2011-04-15 07:12:10 UTC

As discussed on IRC, I'm closing this bug.

I was not able to reproduce with following script:
for i in `seq 100`;do service cman start ; sleep 1; crm_node -p -A; service cman stop;done

I'm unsure that bug is really in corosync IPC libraries. It can also be in here:
      list_for_each_entry(fd, &domains, list) {
                if (fd->cpg_handle)
                        cpg_finalize(fd->cpg_handle);
        }

If list is corrupted it can loop forever.

Andrew will reopen bug if hit problem again.

Note You need to log in before you can comment on or make changes to this bug.