Bug 508353 - corosync-cfgtool -H fails during transition
corosync-cfgtool -H fails during transition
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: corosync (Show other bugs)
rawhide
All Linux
low Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-26 13:23 EDT by David Teigland
Modified: 2009-08-11 11:59 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-08-11 11:59:15 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description David Teigland 2009-06-26 13:23:23 EDT
Description of problem:

When corosync is dealing with a membership change, corosync-cfgtool -H fails to stop it.

# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  19836   2009-06-26 12:11:48  bull-01
   2   M  19840   2009-06-26 12:11:48  bull-02
   4   M  19840   2009-06-26 12:11:48  bull-04
   5   M  19840   2009-06-26 12:11:48  bull-05

# iptables -A OUTPUT -s `corosync-cfgtool -a 0` -p udp --dport 5405 -j DROP; sleep 5; corosync-cfgtool -H
Shutting down corosync

# ps ax | grep corosync
 2697 ?        SLsl   0:00 corosync -f
 2708 pts/0    S+     0:00 grep corosync

# corosync-cfgtool -H
Shutting down corosync
Could not shutdown (error = 14)

# ps ax | grep corosync
 2697 ?        SLsl   0:00 corosync -f
 2712 pts/0    S+     0:00 grep corosync


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Christine Caulfield 2009-07-16 10:08:36 EDT
How long did you wait for corosync to shut down?

If you try to shut it down while the cluster is in transition there will be a delay before corosync gets shut down. In the meantime if you try another shutdown you will get CS_ERR_EXIST because there is already a shutdown in progress.
Comment 2 David Teigland 2009-07-16 14:12:04 EDT
I didn't make note of how long I waited, I'll have to try again.
Comment 3 David Teigland 2009-07-17 17:54:45 EDT
Tried again, it's still running after 5 minutes.  straced it for a few seconds,

# strace -p 13905 -c
Process 13905 attached - interrupt to quit
^CProcess 13905 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
   nan    0.000000           0        86           poll
   nan    0.000000           0        61        61 sendmsg
   nan    0.000000           0        18           recvmsg
   nan    0.000000           0       257           gettimeofday
   nan    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   423        61 total
Comment 4 Christine Caulfield 2009-07-20 03:47:43 EDT
That's very odd, it works fine for me. Can you ping me on IRC and let me have a look at your system please ?
Comment 5 Christine Caulfield 2009-08-10 05:23:44 EDT
I wonder if the leave message gets lots because it's part of the previous ring? Steve is this (even remotely) possible ?

Maybe there should be a timer to make cman shut down even if that message never arrives.
Comment 6 Christine Caulfield 2009-08-11 05:04:08 EDT
Ah, I see the problem here.

You're blocking all corosync traffic not only for the other nodes but for itself. So the LEAVE message never arrives back - in fact NO messages ever arrive back. If you do a "cman_tool nodes" you can see that the node has a totally broken view of the world, because it can't even talk to itself.

I'm tempted to close this NOTABUG because it's a false situation. If you unplug a switch then the node will be able to talk to itself and form a consensus.
Comment 7 David Teigland 2009-08-11 10:25:19 EDT
All I'm looking for is a way of getting rid of corosync without leaking ipc semaphores.  Using kill leaks them, but corosync-cfgtool -H did not leak them
(when it worked).  Did I hear that the current ipc-of-the-month doesn't use shared memory semaphores?  Would that make all this a moot point?
Comment 8 Christine Caulfield 2009-08-11 10:28:21 EDT
It's not corosync-cfgtool that's the problem, it's the iptables rules. If you don't use those then it's all fine as far as I can tell.

A normal kill (not -9) should work without leaking resources. There's a signal handler that's installed I believe. If that doesn't work then it's a bug (but not this one!).
Comment 9 David Teigland 2009-08-11 11:20:47 EDT
OK, I've tried killall corosync (SIGTERM), and sometimes that will work after 10-20 seconds and a couple tries.  I've one instance here where it won't terminate at all.
Comment 10 Christine Caulfield 2009-08-11 11:59:15 EDT
16:58 < chrissie> dct: I'll close that bug shall I? - we've got well beyond 
                  it's scope now
16:58 < dct> yep

Note You need to log in before you can comment on or make changes to this bug.