508353 – corosync-cfgtool -H fails during transition

Bug 508353 - corosync-cfgtool -H fails during transition

Summary: corosync-cfgtool -H fails during transition

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	corosync
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-26 17:23 UTC by David Teigland
Modified:	2009-08-11 15:59 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-08-11 15:59:15 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description David Teigland 2009-06-26 17:23:23 UTC

Description of problem:

When corosync is dealing with a membership change, corosync-cfgtool -H fails to stop it.

# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  19836   2009-06-26 12:11:48  bull-01
   2   M  19840   2009-06-26 12:11:48  bull-02
   4   M  19840   2009-06-26 12:11:48  bull-04
   5   M  19840   2009-06-26 12:11:48  bull-05

# iptables -A OUTPUT -s `corosync-cfgtool -a 0` -p udp --dport 5405 -j DROP; sleep 5; corosync-cfgtool -H
Shutting down corosync

# ps ax | grep corosync
 2697 ?        SLsl   0:00 corosync -f
 2708 pts/0    S+     0:00 grep corosync

# corosync-cfgtool -H
Shutting down corosync
Could not shutdown (error = 14)

# ps ax | grep corosync
 2697 ?        SLsl   0:00 corosync -f
 2712 pts/0    S+     0:00 grep corosync


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Christine Caulfield 2009-07-16 14:08:36 UTC

How long did you wait for corosync to shut down?

If you try to shut it down while the cluster is in transition there will be a delay before corosync gets shut down. In the meantime if you try another shutdown you will get CS_ERR_EXIST because there is already a shutdown in progress.

Comment 2 David Teigland 2009-07-16 18:12:04 UTC

I didn't make note of how long I waited, I'll have to try again.

Comment 3 David Teigland 2009-07-17 21:54:45 UTC

Tried again, it's still running after 5 minutes.  straced it for a few seconds,

# strace -p 13905 -c
Process 13905 attached - interrupt to quit
^CProcess 13905 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
   nan    0.000000           0        86           poll
   nan    0.000000           0        61        61 sendmsg
   nan    0.000000           0        18           recvmsg
   nan    0.000000           0       257           gettimeofday
   nan    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   423        61 total

Comment 4 Christine Caulfield 2009-07-20 07:47:43 UTC

That's very odd, it works fine for me. Can you ping me on IRC and let me have a look at your system please ?

Comment 5 Christine Caulfield 2009-08-10 09:23:44 UTC

I wonder if the leave message gets lots because it's part of the previous ring? Steve is this (even remotely) possible ?

Maybe there should be a timer to make cman shut down even if that message never arrives.

Comment 6 Christine Caulfield 2009-08-11 09:04:08 UTC

Ah, I see the problem here.

You're blocking all corosync traffic not only for the other nodes but for itself. So the LEAVE message never arrives back - in fact NO messages ever arrive back. If you do a "cman_tool nodes" you can see that the node has a totally broken view of the world, because it can't even talk to itself.

I'm tempted to close this NOTABUG because it's a false situation. If you unplug a switch then the node will be able to talk to itself and form a consensus.

Comment 7 David Teigland 2009-08-11 14:25:19 UTC

All I'm looking for is a way of getting rid of corosync without leaking ipc semaphores.  Using kill leaks them, but corosync-cfgtool -H did not leak them
(when it worked).  Did I hear that the current ipc-of-the-month doesn't use shared memory semaphores?  Would that make all this a moot point?

Comment 8 Christine Caulfield 2009-08-11 14:28:21 UTC

It's not corosync-cfgtool that's the problem, it's the iptables rules. If you don't use those then it's all fine as far as I can tell.

A normal kill (not -9) should work without leaking resources. There's a signal handler that's installed I believe. If that doesn't work then it's a bug (but not this one!).

Comment 9 David Teigland 2009-08-11 15:20:47 UTC

OK, I've tried killall corosync (SIGTERM), and sometimes that will work after 10-20 seconds and a couple tries.  I've one instance here where it won't terminate at all.

Comment 10 Christine Caulfield 2009-08-11 15:59:15 UTC

16:58 < chrissie> dct: I'll close that bug shall I? - we've got well beyond 
                  it's scope now
16:58 < dct> yep

Note You need to log in before you can comment on or make changes to this bug.