Bug 888857 - fenced sometimes fails to terminate properly
fenced sometimes fails to terminate properly
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
6.4
Unspecified Unspecified
medium Severity medium
: rc
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-19 11:34 EST by Jaroslav Kortus
Modified: 2013-11-21 06:22 EST (History)
8 users (show)

See Also:
Fixed In Version: cluster-3.0.12.1-53.el6
Doc Type: Bug Fix
Doc Text:
Cause: At closedown, fenced and related daemons could close a file descriptor that was still referenced by the corosync libraries Consequence: this could cause fenced (and other daemons but fenced seems to be most affected) to terminate properly on shutdown. Fix: The file descriptor is not closed, but marked unused in the daemons. Result: The daemons now reliably terminate correctly when cman is shut down.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-21 06:22:52 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Complete strace dump (36.57 KB, application/octet-stream)
2013-01-04 08:55 EST, Jan Friesse
no flags Details
patch (1.25 KB, text/plain)
2013-01-04 12:50 EST, David Teigland
no flags Details
patch (2.18 KB, text/plain)
2013-01-04 12:58 EST, David Teigland
no flags Details

  None (edit)
Description Jaroslav Kortus 2012-12-19 11:34:25 EST
Description of problem:
Sometimes it happens that fenced does not terminate properly when it's expected to do so.


Version-Release number of selected component (if applicable):
cman-3.0.12.1-46.el6.x86_64

How reproducible:
seems to be around 5%

Steps to Reproduce:
1. set dummy fencing agent (/bin/true)
2. block corosync traffic (5404/5405 UDP)
3. wait for nodes to react
4. remove block when others see it offline
5. cman receives kill command from other nodes
6. fenced should terminate in couple of seconds (and here it sometimes does not terminate)
  
Actual results:
fenced spinning

Expected results:
fenced terminated

Additional info:
Comment 1 Jaroslav Kortus 2012-12-19 11:35:24 EST
David Teigland 2012-12-17 13:05:48 EST

The fenced design unfortunately prevents it from processing the clean/orderly removal of a node while waiting for fencing of a failed node to complete.
Based on that, the two options are:

1. delay the clean removal of a node until fencing is complete
   (current behavior)

2. leave the fence domain forcibly, which will make the leaving
   node appear to have failed, causing it to be fenced

The policy of "service cman stop" has always been to shut down the node cleanly,
and to either pause or fail if a component cannot be stopped cleanly.

The existing method of doing a forcible shutdown is "cman_tool leave force".
The forcible/unclean exit of corosync causes any other component to also exit
uncleanly (the "cluster is down, exiting" messages in the logs.)



David Teigland 2012-12-19 10:44:36 EST

Looked at Jaroslav's cluster where fenced was stuck spinning.
The debug log shows that corosync had exited, causing fenced to
also exit.

1355930274 cpg_dispatch error 2
1355930274 cluster is down, exiting
1355930274 daemon cpg_dispatch error 2
1355930274 cpg_dispatch error 2

fenced happened to call into cpg and got stuck spinning in ipc:

[pid 20511] poll([{fd=13, events=0}], 1, 0) = 0 (Timeout)
[pid 20511] futex(0x7f1087ac5070, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1355930276, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

#0  0x00007f1086c81811 in sem_timedwait () from /lib64/libpthread.so.0
#1  0x00007f108589d13c in coroipcc_msg_send_reply_receive ()
    from /usr/lib64/libcoroipcc.so.4
#2  0x00007f1086e95987 in cpg_finalize () from /usr/lib64/libcpg.so.4
Comment 4 David Teigland 2012-12-19 11:45:20 EST
The first part of comment 1 is not related to the second part, which is the subject of this bz.

In the second part, fenced gets stuck spinning in corosync ipc after corosync has died, so it seems to be yet another corosync ipc issue.  This is not terribly critical because we don't typically care much what happens in the case where corosync is killed while being used.  The node will be fenced/rebooted in a short time anyway.
Comment 5 RHEL Product and Program Management 2012-12-24 01:53:01 EST
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 8 Jan Friesse 2013-01-04 08:55:04 EST
Few notes before weekend:
Bug is very easy reproducible by:

while true;do cman_tool join -D ; fenced; fence_tool join; sleep 2; fence_tool ls;  killall -9 corosync;sleep 4;done

When fenced stuck, it is waiting for socket. This socket (as turned out) is NOT created by corosync/corosynclib but by syslog (it's DGRAM, corosync is using only STREAM).

By running more magic command:
for i in `seq 1 200`;do cman_tool join -D ; nohup strace -f fenced &>/tmp/fd-$i & sleep 1; fence_tool join; sleep 2; fence_t
ool ls;echo $i - `pidof fenced` >> /tmp/fds.txt;killall -9 corosync;sleep 4;done

It's able to find out that:
[pid  1733] futex(0x7f3452e34010, FUTEX_WAKE, 1) = 0
[pid  1733] poll([{fd=15, events=POLLIN}], 1, 0) = 1 ([{fd=15, revents=POLLIN|POLLHUP}])
[pid  1733] gettimeofday({1357302429, 210530}, NULL) = 0
[pid  1733] gettimeofday({1357302429, 210567}, NULL) = 0
[pid  1733] close(15)                   = 0

(and fd 15 is then...
[pid  1734] connect(15, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0

...opened by syslog. And after that:

...
[pid  1733] poll([{fd=15, events=0}], 1, 0) = 0 (Timeout)
...

corosync is waiting on fd 15.

Question is, who is closing fd 15?

In next comment is attached complete strace.

DCT:
Isn't by any chance possible that fenced is closing socket returned by corosync lib?
Comment 9 Jan Friesse 2013-01-04 08:55:30 EST
Created attachment 672406 [details]
Complete strace dump
Comment 10 David Teigland 2013-01-04 12:42:16 EST
[pid  1733] poll([{fd=15, events=POLLIN}], 1, 0) = 1 ([{fd=15, revents=POLLIN|POLLHUP}])
[pid  1733] gettimeofday({1357302429, 210530}, NULL) = 0
[pid  1733] gettimeofday({1357302429, 210567}, NULL) = 0
[pid  1733] close(15)                   = 0

From the strace, fd 15 is the fd for corosync ipc.
poll on that fd returns POLLIN|POLLHUP.
When fenced sees POLLHUP, it closes the fd directly in deadfn(). [*]
fenced then reuses fd 15 for syslog.
corosync ipc still thinks it's using fd 15, so the lib calls go bad.

[*]
The way that fenced handles POLLHUP is by calling the client connection's deadfn() function.  When there is no client-specific deadfn(), the default client_dead() function is used.  client_dead() closes the fd.  corosync connections are supposed to use the cluster_dead() function, not the client_dead() function.  But, there's a corosync client connection added in fd_join() that is not specifying cluster_dead(), so it gets client_dead().  I believe this is the problem.  cluster_dead() does not close the fd, just causes
the daemon to exit.

So, this will probably fix the problem:

- ci = client_add(cpg_fd_domain, process_cpg_domain, NULL);
+ ci = client_add(cpg_fd_domain, process_cpg_domain, cluster_dead);
Comment 11 David Teigland 2013-01-04 12:50:42 EST
Created attachment 672501 [details]
patch
Comment 12 David Teigland 2013-01-04 12:58:16 EST
Created attachment 672503 [details]
patch

Fix the problem in fenced, dlm_controld and gfs_controld.
Comment 13 Fabio Massimo Di Nitto 2013-01-14 11:26:49 EST
6.5 it is.

jkortus to test dct patch before patch can be pushed around
Comment 26 errata-xmlrpc 2013-11-21 06:22:52 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1617.html

Note You need to log in before you can comment on or make changes to this bug.