Bug 888857 - fenced sometimes fails to terminate properly
Summary: fenced sometimes fails to terminate properly
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-12-19 16:34 UTC by Jaroslav Kortus
Modified: 2013-11-21 11:22 UTC (History)
8 users (show)

(edit)
Cause: 

At closedown, fenced and related daemons could close a file descriptor that was still referenced by the corosync libraries

Consequence: 
this could cause fenced (and other daemons but fenced seems to be most affected) to terminate properly on shutdown.

Fix: 
The file descriptor is not closed, but marked unused in the daemons.

Result: 
The daemons now reliably terminate correctly when cman is shut down.
Clone Of:
(edit)
Last Closed: 2013-11-21 11:22:52 UTC


Attachments (Terms of Use)
Complete strace dump (36.57 KB, application/octet-stream)
2013-01-04 13:55 UTC, Jan Friesse
no flags Details
patch (1.25 KB, text/plain)
2013-01-04 17:50 UTC, David Teigland
no flags Details
patch (2.18 KB, text/plain)
2013-01-04 17:58 UTC, David Teigland
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:1617 normal SHIPPED_LIVE cluster and gfs2-utils bug fix update 2013-11-20 21:38:45 UTC

Description Jaroslav Kortus 2012-12-19 16:34:25 UTC
Description of problem:
Sometimes it happens that fenced does not terminate properly when it's expected to do so.


Version-Release number of selected component (if applicable):
cman-3.0.12.1-46.el6.x86_64

How reproducible:
seems to be around 5%

Steps to Reproduce:
1. set dummy fencing agent (/bin/true)
2. block corosync traffic (5404/5405 UDP)
3. wait for nodes to react
4. remove block when others see it offline
5. cman receives kill command from other nodes
6. fenced should terminate in couple of seconds (and here it sometimes does not terminate)
  
Actual results:
fenced spinning

Expected results:
fenced terminated

Additional info:

Comment 1 Jaroslav Kortus 2012-12-19 16:35:24 UTC
David Teigland 2012-12-17 13:05:48 EST

The fenced design unfortunately prevents it from processing the clean/orderly removal of a node while waiting for fencing of a failed node to complete.
Based on that, the two options are:

1. delay the clean removal of a node until fencing is complete
   (current behavior)

2. leave the fence domain forcibly, which will make the leaving
   node appear to have failed, causing it to be fenced

The policy of "service cman stop" has always been to shut down the node cleanly,
and to either pause or fail if a component cannot be stopped cleanly.

The existing method of doing a forcible shutdown is "cman_tool leave force".
The forcible/unclean exit of corosync causes any other component to also exit
uncleanly (the "cluster is down, exiting" messages in the logs.)



David Teigland 2012-12-19 10:44:36 EST

Looked at Jaroslav's cluster where fenced was stuck spinning.
The debug log shows that corosync had exited, causing fenced to
also exit.

1355930274 cpg_dispatch error 2
1355930274 cluster is down, exiting
1355930274 daemon cpg_dispatch error 2
1355930274 cpg_dispatch error 2

fenced happened to call into cpg and got stuck spinning in ipc:

[pid 20511] poll([{fd=13, events=0}], 1, 0) = 0 (Timeout)
[pid 20511] futex(0x7f1087ac5070, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1355930276, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

#0  0x00007f1086c81811 in sem_timedwait () from /lib64/libpthread.so.0
#1  0x00007f108589d13c in coroipcc_msg_send_reply_receive ()
    from /usr/lib64/libcoroipcc.so.4
#2  0x00007f1086e95987 in cpg_finalize () from /usr/lib64/libcpg.so.4

Comment 4 David Teigland 2012-12-19 16:45:20 UTC
The first part of comment 1 is not related to the second part, which is the subject of this bz.

In the second part, fenced gets stuck spinning in corosync ipc after corosync has died, so it seems to be yet another corosync ipc issue.  This is not terribly critical because we don't typically care much what happens in the case where corosync is killed while being used.  The node will be fenced/rebooted in a short time anyway.

Comment 5 RHEL Product and Program Management 2012-12-24 06:53:01 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 8 Jan Friesse 2013-01-04 13:55:04 UTC
Few notes before weekend:
Bug is very easy reproducible by:

while true;do cman_tool join -D ; fenced; fence_tool join; sleep 2; fence_tool ls;  killall -9 corosync;sleep 4;done

When fenced stuck, it is waiting for socket. This socket (as turned out) is NOT created by corosync/corosynclib but by syslog (it's DGRAM, corosync is using only STREAM).

By running more magic command:
for i in `seq 1 200`;do cman_tool join -D ; nohup strace -f fenced &>/tmp/fd-$i & sleep 1; fence_tool join; sleep 2; fence_t
ool ls;echo $i - `pidof fenced` >> /tmp/fds.txt;killall -9 corosync;sleep 4;done

It's able to find out that:
[pid  1733] futex(0x7f3452e34010, FUTEX_WAKE, 1) = 0
[pid  1733] poll([{fd=15, events=POLLIN}], 1, 0) = 1 ([{fd=15, revents=POLLIN|POLLHUP}])
[pid  1733] gettimeofday({1357302429, 210530}, NULL) = 0
[pid  1733] gettimeofday({1357302429, 210567}, NULL) = 0
[pid  1733] close(15)                   = 0

(and fd 15 is then...
[pid  1734] connect(15, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0

...opened by syslog. And after that:

...
[pid  1733] poll([{fd=15, events=0}], 1, 0) = 0 (Timeout)
...

corosync is waiting on fd 15.

Question is, who is closing fd 15?

In next comment is attached complete strace.

DCT:
Isn't by any chance possible that fenced is closing socket returned by corosync lib?

Comment 9 Jan Friesse 2013-01-04 13:55:30 UTC
Created attachment 672406 [details]
Complete strace dump

Comment 10 David Teigland 2013-01-04 17:42:16 UTC
[pid  1733] poll([{fd=15, events=POLLIN}], 1, 0) = 1 ([{fd=15, revents=POLLIN|POLLHUP}])
[pid  1733] gettimeofday({1357302429, 210530}, NULL) = 0
[pid  1733] gettimeofday({1357302429, 210567}, NULL) = 0
[pid  1733] close(15)                   = 0

From the strace, fd 15 is the fd for corosync ipc.
poll on that fd returns POLLIN|POLLHUP.
When fenced sees POLLHUP, it closes the fd directly in deadfn(). [*]
fenced then reuses fd 15 for syslog.
corosync ipc still thinks it's using fd 15, so the lib calls go bad.

[*]
The way that fenced handles POLLHUP is by calling the client connection's deadfn() function.  When there is no client-specific deadfn(), the default client_dead() function is used.  client_dead() closes the fd.  corosync connections are supposed to use the cluster_dead() function, not the client_dead() function.  But, there's a corosync client connection added in fd_join() that is not specifying cluster_dead(), so it gets client_dead().  I believe this is the problem.  cluster_dead() does not close the fd, just causes
the daemon to exit.

So, this will probably fix the problem:

- ci = client_add(cpg_fd_domain, process_cpg_domain, NULL);
+ ci = client_add(cpg_fd_domain, process_cpg_domain, cluster_dead);

Comment 11 David Teigland 2013-01-04 17:50:42 UTC
Created attachment 672501 [details]
patch

Comment 12 David Teigland 2013-01-04 17:58:16 UTC
Created attachment 672503 [details]
patch

Fix the problem in fenced, dlm_controld and gfs_controld.

Comment 13 Fabio Massimo Di Nitto 2013-01-14 16:26:49 UTC
6.5 it is.

jkortus to test dct patch before patch can be pushed around

Comment 26 errata-xmlrpc 2013-11-21 11:22:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1617.html


Note You need to log in before you can comment on or make changes to this bug.