888857 – fenced sometimes fails to terminate properly

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 888857 - fenced sometimes fails to terminate properly

Summary: fenced sometimes fails to terminate properly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-12-19 16:34 UTC by Jaroslav Kortus
Modified:	2013-11-21 11:22 UTC (History)
CC List:	8 users (show)
Fixed In Version:	cluster-3.0.12.1-53.el6
Doc Type:	Bug Fix
Doc Text:	Cause: At closedown, fenced and related daemons could close a file descriptor that was still referenced by the corosync libraries Consequence: this could cause fenced (and other daemons but fenced seems to be most affected) to terminate properly on shutdown. Fix: The file descriptor is not closed, but marked unused in the daemons. Result: The daemons now reliably terminate correctly when cman is shut down.
Clone Of:
Environment:
Last Closed:	2013-11-21 11:22:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Complete strace dump (36.57 KB, application/octet-stream) 2013-01-04 13:55 UTC, Jan Friesse	no flags	Details
patch (1.25 KB, text/plain) 2013-01-04 17:50 UTC, David Teigland	no flags	Details
patch (2.18 KB, text/plain) 2013-01-04 17:58 UTC, David Teigland	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1617	0	normal	SHIPPED_LIVE	cluster and gfs2-utils bug fix update	2013-11-20 21:38:45 UTC

Description Jaroslav Kortus 2012-12-19 16:34:25 UTC

Description of problem:
Sometimes it happens that fenced does not terminate properly when it's expected to do so.


Version-Release number of selected component (if applicable):
cman-3.0.12.1-46.el6.x86_64

How reproducible:
seems to be around 5%

Steps to Reproduce:
1. set dummy fencing agent (/bin/true)
2. block corosync traffic (5404/5405 UDP)
3. wait for nodes to react
4. remove block when others see it offline
5. cman receives kill command from other nodes
6. fenced should terminate in couple of seconds (and here it sometimes does not terminate)
  
Actual results:
fenced spinning

Expected results:
fenced terminated

Additional info:

Comment 1 Jaroslav Kortus 2012-12-19 16:35:24 UTC

David Teigland 2012-12-17 13:05:48 EST

The fenced design unfortunately prevents it from processing the clean/orderly removal of a node while waiting for fencing of a failed node to complete.
Based on that, the two options are:

1. delay the clean removal of a node until fencing is complete
   (current behavior)

2. leave the fence domain forcibly, which will make the leaving
   node appear to have failed, causing it to be fenced

The policy of "service cman stop" has always been to shut down the node cleanly,
and to either pause or fail if a component cannot be stopped cleanly.

The existing method of doing a forcible shutdown is "cman_tool leave force".
The forcible/unclean exit of corosync causes any other component to also exit
uncleanly (the "cluster is down, exiting" messages in the logs.)



David Teigland 2012-12-19 10:44:36 EST

Looked at Jaroslav's cluster where fenced was stuck spinning.
The debug log shows that corosync had exited, causing fenced to
also exit.

1355930274 cpg_dispatch error 2
1355930274 cluster is down, exiting
1355930274 daemon cpg_dispatch error 2
1355930274 cpg_dispatch error 2

fenced happened to call into cpg and got stuck spinning in ipc:

[pid 20511] poll([{fd=13, events=0}], 1, 0) = 0 (Timeout)
[pid 20511] futex(0x7f1087ac5070, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1355930276, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

#0  0x00007f1086c81811 in sem_timedwait () from /lib64/libpthread.so.0
#1  0x00007f108589d13c in coroipcc_msg_send_reply_receive ()
    from /usr/lib64/libcoroipcc.so.4
#2  0x00007f1086e95987 in cpg_finalize () from /usr/lib64/libcpg.so.4

Comment 4 David Teigland 2012-12-19 16:45:20 UTC

The first part of comment 1 is not related to the second part, which is the subject of this bz.

In the second part, fenced gets stuck spinning in corosync ipc after corosync has died, so it seems to be yet another corosync ipc issue.  This is not terribly critical because we don't typically care much what happens in the case where corosync is killed while being used.  The node will be fenced/rebooted in a short time anyway.

Comment 5 RHEL Program Management 2012-12-24 06:53:01 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 8 Jan Friesse 2013-01-04 13:55:04 UTC

Few notes before weekend:
Bug is very easy reproducible by:

while true;do cman_tool join -D ; fenced; fence_tool join; sleep 2; fence_tool ls;  killall -9 corosync;sleep 4;done

When fenced stuck, it is waiting for socket. This socket (as turned out) is NOT created by corosync/corosynclib but by syslog (it's DGRAM, corosync is using only STREAM).

By running more magic command:
for i in `seq 1 200`;do cman_tool join -D ; nohup strace -f fenced &>/tmp/fd-$i & sleep 1; fence_tool join; sleep 2; fence_t
ool ls;echo $i - `pidof fenced` >> /tmp/fds.txt;killall -9 corosync;sleep 4;done

It's able to find out that:
[pid  1733] futex(0x7f3452e34010, FUTEX_WAKE, 1) = 0
[pid  1733] poll([{fd=15, events=POLLIN}], 1, 0) = 1 ([{fd=15, revents=POLLIN|POLLHUP}])
[pid  1733] gettimeofday({1357302429, 210530}, NULL) = 0
[pid  1733] gettimeofday({1357302429, 210567}, NULL) = 0
[pid  1733] close(15)                   = 0

(and fd 15 is then...
[pid  1734] connect(15, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0

...opened by syslog. And after that:

...
[pid  1733] poll([{fd=15, events=0}], 1, 0) = 0 (Timeout)
...

corosync is waiting on fd 15.

Question is, who is closing fd 15?

In next comment is attached complete strace.

DCT:
Isn't by any chance possible that fenced is closing socket returned by corosync lib?

Comment 9 Jan Friesse 2013-01-04 13:55:30 UTC

Created attachment 672406 [details]
Complete strace dump

Comment 10 David Teigland 2013-01-04 17:42:16 UTC

[pid  1733] poll([{fd=15, events=POLLIN}], 1, 0) = 1 ([{fd=15, revents=POLLIN|POLLHUP}])
[pid  1733] gettimeofday({1357302429, 210530}, NULL) = 0
[pid  1733] gettimeofday({1357302429, 210567}, NULL) = 0
[pid  1733] close(15)                   = 0

From the strace, fd 15 is the fd for corosync ipc.
poll on that fd returns POLLIN|POLLHUP.
When fenced sees POLLHUP, it closes the fd directly in deadfn(). [*]
fenced then reuses fd 15 for syslog.
corosync ipc still thinks it's using fd 15, so the lib calls go bad.

[*]
The way that fenced handles POLLHUP is by calling the client connection's deadfn() function.  When there is no client-specific deadfn(), the default client_dead() function is used.  client_dead() closes the fd.  corosync connections are supposed to use the cluster_dead() function, not the client_dead() function.  But, there's a corosync client connection added in fd_join() that is not specifying cluster_dead(), so it gets client_dead().  I believe this is the problem.  cluster_dead() does not close the fd, just causes
the daemon to exit.

So, this will probably fix the problem:

- ci = client_add(cpg_fd_domain, process_cpg_domain, NULL);
+ ci = client_add(cpg_fd_domain, process_cpg_domain, cluster_dead);

Comment 11 David Teigland 2013-01-04 17:50:42 UTC

Created attachment 672501 [details]
patch

Comment 12 David Teigland 2013-01-04 17:58:16 UTC

Created attachment 672503 [details]
patch

Fix the problem in fenced, dlm_controld and gfs_controld.

Comment 13 Fabio Massimo Di Nitto 2013-01-14 16:26:49 UTC

6.5 it is.

jkortus to test dct patch before patch can be pushed around

Comment 17 David Teigland 2013-05-10 14:43:12 UTC

pushed to RHEL6 branch
https://git.fedorahosted.org/cgit/cluster.git/commit/?h=RHEL6&id=62a5ab048683b4ce77f0eeaba81d3330da574d41

Comment 26 errata-xmlrpc 2013-11-21 11:22:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1617.html

Note You need to log in before you can comment on or make changes to this bug.