Description of problem: When corosync is dealing with a membership change, corosync-cfgtool -H fails to stop it. # cman_tool nodes Node Sts Inc Joined Name 1 M 19836 2009-06-26 12:11:48 bull-01 2 M 19840 2009-06-26 12:11:48 bull-02 4 M 19840 2009-06-26 12:11:48 bull-04 5 M 19840 2009-06-26 12:11:48 bull-05 # iptables -A OUTPUT -s `corosync-cfgtool -a 0` -p udp --dport 5405 -j DROP; sleep 5; corosync-cfgtool -H Shutting down corosync # ps ax | grep corosync 2697 ? SLsl 0:00 corosync -f 2708 pts/0 S+ 0:00 grep corosync # corosync-cfgtool -H Shutting down corosync Could not shutdown (error = 14) # ps ax | grep corosync 2697 ? SLsl 0:00 corosync -f 2712 pts/0 S+ 0:00 grep corosync Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
How long did you wait for corosync to shut down? If you try to shut it down while the cluster is in transition there will be a delay before corosync gets shut down. In the meantime if you try another shutdown you will get CS_ERR_EXIST because there is already a shutdown in progress.
I didn't make note of how long I waited, I'll have to try again.
Tried again, it's still running after 5 minutes. straced it for a few seconds, # strace -p 13905 -c Process 13905 attached - interrupt to quit ^CProcess 13905 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- nan 0.000000 0 86 poll nan 0.000000 0 61 61 sendmsg nan 0.000000 0 18 recvmsg nan 0.000000 0 257 gettimeofday nan 0.000000 0 1 restart_syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000000 423 61 total
That's very odd, it works fine for me. Can you ping me on IRC and let me have a look at your system please ?
I wonder if the leave message gets lots because it's part of the previous ring? Steve is this (even remotely) possible ? Maybe there should be a timer to make cman shut down even if that message never arrives.
Ah, I see the problem here. You're blocking all corosync traffic not only for the other nodes but for itself. So the LEAVE message never arrives back - in fact NO messages ever arrive back. If you do a "cman_tool nodes" you can see that the node has a totally broken view of the world, because it can't even talk to itself. I'm tempted to close this NOTABUG because it's a false situation. If you unplug a switch then the node will be able to talk to itself and form a consensus.
All I'm looking for is a way of getting rid of corosync without leaking ipc semaphores. Using kill leaks them, but corosync-cfgtool -H did not leak them (when it worked). Did I hear that the current ipc-of-the-month doesn't use shared memory semaphores? Would that make all this a moot point?
It's not corosync-cfgtool that's the problem, it's the iptables rules. If you don't use those then it's all fine as far as I can tell. A normal kill (not -9) should work without leaking resources. There's a signal handler that's installed I believe. If that doesn't work then it's a bug (but not this one!).
OK, I've tried killall corosync (SIGTERM), and sometimes that will work after 10-20 seconds and a couple tries. I've one instance here where it won't terminate at all.
16:58 < chrissie> dct: I'll close that bug shall I? - we've got well beyond it's scope now 16:58 < dct> yep