Red Hat Bugzilla – Bug 888857
fenced sometimes fails to terminate properly
Last modified: 2013-11-21 06:22:52 EST
Description of problem: Sometimes it happens that fenced does not terminate properly when it's expected to do so. Version-Release number of selected component (if applicable): cman-3.0.12.1-46.el6.x86_64 How reproducible: seems to be around 5% Steps to Reproduce: 1. set dummy fencing agent (/bin/true) 2. block corosync traffic (5404/5405 UDP) 3. wait for nodes to react 4. remove block when others see it offline 5. cman receives kill command from other nodes 6. fenced should terminate in couple of seconds (and here it sometimes does not terminate) Actual results: fenced spinning Expected results: fenced terminated Additional info:
David Teigland 2012-12-17 13:05:48 EST The fenced design unfortunately prevents it from processing the clean/orderly removal of a node while waiting for fencing of a failed node to complete. Based on that, the two options are: 1. delay the clean removal of a node until fencing is complete (current behavior) 2. leave the fence domain forcibly, which will make the leaving node appear to have failed, causing it to be fenced The policy of "service cman stop" has always been to shut down the node cleanly, and to either pause or fail if a component cannot be stopped cleanly. The existing method of doing a forcible shutdown is "cman_tool leave force". The forcible/unclean exit of corosync causes any other component to also exit uncleanly (the "cluster is down, exiting" messages in the logs.) David Teigland 2012-12-19 10:44:36 EST Looked at Jaroslav's cluster where fenced was stuck spinning. The debug log shows that corosync had exited, causing fenced to also exit. 1355930274 cpg_dispatch error 2 1355930274 cluster is down, exiting 1355930274 daemon cpg_dispatch error 2 1355930274 cpg_dispatch error 2 fenced happened to call into cpg and got stuck spinning in ipc: [pid 20511] poll([{fd=13, events=0}], 1, 0) = 0 (Timeout) [pid 20511] futex(0x7f1087ac5070, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1355930276, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out) #0 0x00007f1086c81811 in sem_timedwait () from /lib64/libpthread.so.0 #1 0x00007f108589d13c in coroipcc_msg_send_reply_receive () from /usr/lib64/libcoroipcc.so.4 #2 0x00007f1086e95987 in cpg_finalize () from /usr/lib64/libcpg.so.4
The first part of comment 1 is not related to the second part, which is the subject of this bz. In the second part, fenced gets stuck spinning in corosync ipc after corosync has died, so it seems to be yet another corosync ipc issue. This is not terribly critical because we don't typically care much what happens in the case where corosync is killed while being used. The node will be fenced/rebooted in a short time anyway.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Few notes before weekend: Bug is very easy reproducible by: while true;do cman_tool join -D ; fenced; fence_tool join; sleep 2; fence_tool ls; killall -9 corosync;sleep 4;done When fenced stuck, it is waiting for socket. This socket (as turned out) is NOT created by corosync/corosynclib but by syslog (it's DGRAM, corosync is using only STREAM). By running more magic command: for i in `seq 1 200`;do cman_tool join -D ; nohup strace -f fenced &>/tmp/fd-$i & sleep 1; fence_tool join; sleep 2; fence_t ool ls;echo $i - `pidof fenced` >> /tmp/fds.txt;killall -9 corosync;sleep 4;done It's able to find out that: [pid 1733] futex(0x7f3452e34010, FUTEX_WAKE, 1) = 0 [pid 1733] poll([{fd=15, events=POLLIN}], 1, 0) = 1 ([{fd=15, revents=POLLIN|POLLHUP}]) [pid 1733] gettimeofday({1357302429, 210530}, NULL) = 0 [pid 1733] gettimeofday({1357302429, 210567}, NULL) = 0 [pid 1733] close(15) = 0 (and fd 15 is then... [pid 1734] connect(15, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0 ...opened by syslog. And after that: ... [pid 1733] poll([{fd=15, events=0}], 1, 0) = 0 (Timeout) ... corosync is waiting on fd 15. Question is, who is closing fd 15? In next comment is attached complete strace. DCT: Isn't by any chance possible that fenced is closing socket returned by corosync lib?
Created attachment 672406 [details] Complete strace dump
[pid 1733] poll([{fd=15, events=POLLIN}], 1, 0) = 1 ([{fd=15, revents=POLLIN|POLLHUP}]) [pid 1733] gettimeofday({1357302429, 210530}, NULL) = 0 [pid 1733] gettimeofday({1357302429, 210567}, NULL) = 0 [pid 1733] close(15) = 0 From the strace, fd 15 is the fd for corosync ipc. poll on that fd returns POLLIN|POLLHUP. When fenced sees POLLHUP, it closes the fd directly in deadfn(). [*] fenced then reuses fd 15 for syslog. corosync ipc still thinks it's using fd 15, so the lib calls go bad. [*] The way that fenced handles POLLHUP is by calling the client connection's deadfn() function. When there is no client-specific deadfn(), the default client_dead() function is used. client_dead() closes the fd. corosync connections are supposed to use the cluster_dead() function, not the client_dead() function. But, there's a corosync client connection added in fd_join() that is not specifying cluster_dead(), so it gets client_dead(). I believe this is the problem. cluster_dead() does not close the fd, just causes the daemon to exit. So, this will probably fix the problem: - ci = client_add(cpg_fd_domain, process_cpg_domain, NULL); + ci = client_add(cpg_fd_domain, process_cpg_domain, cluster_dead);
Created attachment 672501 [details] patch
Created attachment 672503 [details] patch Fix the problem in fenced, dlm_controld and gfs_controld.
6.5 it is. jkortus to test dct patch before patch can be pushed around
pushed to RHEL6 branch https://git.fedorahosted.org/cgit/cluster.git/commit/?h=RHEL6&id=62a5ab048683b4ce77f0eeaba81d3330da574d41
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1617.html