1707851 – unsatisfactory recovery from pacemaker-daemons stalled via SIGSTOP

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1707851 - unsatisfactory recovery from pacemaker-daemons stalled via SIGSTOP

Summary: unsatisfactory recovery from pacemaker-daemons stalled via SIGSTOP

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	9.0
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	2031865
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-08 14:49 UTC by Klaus Wenninger
Modified:	2022-05-17 12:23 UTC (History)
CC List:	6 users (show)
Fixed In Version:	pacemaker-2.1.2-3.el9
Doc Type:	Enhancement
Doc Text:	Feature: Pacemaker now monitors its component subdaemons for IPC responsiveness. Reason: Previously, if a daemon stopped being responsive (for example, after receiving a SIGSTOP signal), the cluster might not detect any problem. Result: Now, Pacemaker will detect unresponsive subdaemons and recover them if necessary.
Clone Of:
Clones:	2031865 (view as bug list)
Environment:
Last Closed:	2022-05-17 12:20:40 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Cluster Labs	5356	None	None	None	2020-07-31 17:40:31 UTC
Red Hat Issue Tracker	CLUSTERQE-5489	None	None	None	2022-03-15 03:03:17 UTC
Red Hat Product Errata	RHBA-2022:2293	None	None	None	2022-05-17 12:20:51 UTC

Description Klaus Wenninger 2019-05-08 14:49:16 UTC

Description of problem:

When the pacemaker-daemons are stalled using SIGSTOP this leads to

- very sluggish recovery when done on non-DC nodes
- no recovery at all when done on the DC 


Version-Release number of selected component (if applicable):

2.0.1-5.el8

How reproducible:

100%

Steps to Reproduce:
1. killall -STOP pacemaker-...
2.
3.

Actual results:

If this is done on the DC nothing happens at all.
If done on a non-DC node after a long timeout (few min) the node is discovered to be unclean and fenced.

Expected results:

Recovery actions (e.g. fencing)  are started after a couple of seconds.

Additional info:

When we are running sbd stalling 'based' leads to immediate recovery via sbd detecting that it can't get the node-state from pacemaker via cib (as long as sbd is used without shared disk - with the disk sbd would be content having access to the disk).

Comment 1 Klaus Wenninger 2019-05-08 16:23:56 UTC

Behaviour with stalling corosync-daemon is btw. a little different.
Left over nodes will form a new partition with a new DC that decides
then to fence the node with corosync stalled.
Of course stalling corosync will break the path from the cib of the
new DC back to the cib of the node with corosync stalled.
Thus when using sbd with watchdog-fencing pacemaker-watcher isn't
gonna read the 'unclean' state from the cib and thus won't trigger
self-fencing.
This is where

bz1702727 - sbd doesn't detect non-responsive corosync-daemon

comes into the game.

Comment 2 Patrik Hagara 2020-03-23 09:51:39 UTC

qa_ack+, reproducer in description and comment#1

Comment 16 Ken Gaillot 2021-11-22 16:51:14 UTC

It turns out this will require changes in libqb for a full fix. This bz might end up getting bumped to 8.7, or we might implement a partial fix for 8.6.

Comment 17 Ken Gaillot 2021-12-13 22:12:10 UTC

The fix for this depends on the libqb feature in Bug 2031865, which will likely land in 9.0 but not make RHEL 8 until 8.7, so this bz is being re-targeted to 9.0.

Comment 19 Ken Gaillot 2022-01-19 23:04:27 UTC

Fixed upstream as of commit 4b60aa100

Comment 23 Patrik Hagara 2022-02-25 11:05:23 UTC

before
======

> [root@virt-146 ~]# rpm -q pacemaker libqb
> pacemaker-2.1.0-8.el8.x86_64
> libqb-1.0.3-12.el8.x86_64
> [root@virt-146 ~]# pcs status
> Cluster name: STSRHTS15235
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-146 (version 2.1.0-8.el8-7c3f660707) - partition with quorum
>   * Last updated: Fri Feb 25 10:44:32 2022
>   * Last change:  Fri Feb 25 10:24:02 2022 by root via cibadmin on virt-144
>   * 3 nodes configured
>   * 3 resource instances configured
> 
> Node List:
>   * Online: [ virt-144 virt-145 virt-146 ]
> 
> Full List of Resources:
>   * fence-virt-144      (stonith:fence_xvm):     Started virt-144
>   * fence-virt-145      (stonith:fence_xvm):     Started virt-145
>   * fence-virt-146      (stonith:fence_xvm):     Started virt-146
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-146 ~]# killall -STOP pacemakerd pacemaker-based pacemaker-fenced pacemaker-execd pacemaker-attrd pacemaker-schedulerd pacemaker-controld
> [root@virt-146 ~]# ps faux | grep pacemaker
> root       50480  0.0  0.2 134948 10236 ?        Ts   10:23   0:00 /usr/sbin/pacemakerd
> haclust+   50481  0.0  0.5 156104 22728 ?        Ts   10:23   0:00  \_ /usr/libexec/pacemaker/pacemaker-based
> root       50482  0.0  0.3 154060 15440 ?        Ts   10:23   0:00  \_ /usr/libexec/pacemaker/pacemaker-fenced
> root       50483  0.0  0.2 116964 10176 ?        Ts   10:23   0:00  \_ /usr/libexec/pacemaker/pacemaker-execd
> haclust+   50484  0.0  0.2 145064 12360 ?        Ts   10:23   0:00  \_ /usr/libexec/pacemaker/pacemaker-attrd
> haclust+   50485  0.0  0.6 160532 26332 ?        Ts   10:23   0:00  \_ /usr/libexec/pacemaker/pacemaker-schedulerd
> haclust+   50486  0.0  0.4 202912 17812 ?        Ts   10:23   0:00  \_ /usr/libexec/pacemaker/pacemaker-controld
> root       56689  0.0  0.0  25980  3352 ?        S    10:58   0:00  \_ sh -c ps faux | grep pacemaker
> root       56691  0.0  0.0  12136  1044 ?        S    10:58   0:00      \_ grep pacemaker

result: minutes pass, stalled DC does not get fenced, other nodes log nothing at all.


after
=====

> [root@virt-499 ~]# rpm -q pacemaker libqb
> pacemaker-2.1.2-4.el9.x86_64
> libqb-2.0.3-7.el9.x86_64
> [root@virt-499 ~]# pcs status
> Cluster name: STSRHTS12845
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-499 (version 2.1.2-4.el9-ada5c3b36e2) - partition with quorum
>   * Last updated: Fri Feb 25 11:21:58 2022
>   * Last change:  Fri Feb 25 10:20:41 2022 by root via cibadmin on virt-497
>   * 3 nodes configured
>   * 3 resource instances configured
> 
> Node List:
>   * Online: [ virt-497 virt-498 virt-499 ]
> 
> Full List of Resources:
>   * fence-virt-497      (stonith:fence_xvm):     Started virt-497
>   * fence-virt-498      (stonith:fence_xvm):     Started virt-498
>   * fence-virt-499      (stonith:fence_xvm):     Started virt-499
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-499 ~]# killall -STOP pacemakerd pacemaker-based pacemaker-fenced pacemaker-execd pacemaker-attrd pacemaker-schedulerd pacemaker-controld
> [root@virt-499 ~]# ps faux | grep pacemaker
> root       62034  0.0  0.0   6416  2208 pts/0    S+   11:22   0:00              \_ grep --color=auto pacemaker
> root       54199  0.0  0.2  32312 11580 ?        Ts   10:20   0:01 /usr/sbin/pacemakerd
> haclust+   54200  0.0  0.6  49468 24768 ?        Ts   10:20   0:01  \_ /usr/libexec/pacemaker/pacemaker-based
> root       54201  0.0  0.4  41588 17456 ?        Ts   10:20   0:01  \_ /usr/libexec/pacemaker/pacemaker-fenced
> root       54202  0.0  0.3  26632 12200 ?        Ts   10:20   0:01  \_ /usr/libexec/pacemaker/pacemaker-execd
> haclust+   54203  0.0  0.3  39464 15280 ?        Ts   10:20   0:01  \_ /usr/libexec/pacemaker/pacemaker-attrd
> haclust+   54204  0.0  0.7  62092 28464 ?        Ts   10:20   0:01  \_ /usr/libexec/pacemaker/pacemaker-schedulerd
> haclust+   54205  0.0  0.4  90128 20088 ?        Ts   10:20   0:01  \_ /usr/libexec/pacemaker/pacemaker-controld

result: same as before the fix, rest of the cluster does not notice DC is stalled.

only after unblocking the pacemakerd process (but not the other pacemaker-{base,fence,exec,attr,scheduler,control}d daemons) using `killall -CONT pacemakerd`, the DC is finally fenced with a few seconds delay... (before this fix, unblocking pacemakerd on the DC had no effect, cluster remained in the "zombie" state)

still, this seems like only a marginal improvement compared to the previous behavior.


peeking at the code changes, i'm surprised this was implemented in a way that the DC's pacemakerd must be alive & well in order to detect the other pacemaker-*d daemon stalls.

@kgaillot is there any way for the other nodes (ie. not the DC itself) to detect that the DC's daemons are stalled? or is the pacemakerd code considered simple enough (read: practically impossible to deadlock/stall due to eg. disk/network/other blocking operations)?

Comment 24 Ken Gaillot 2022-02-25 16:10:22 UTC

> peeking at the code changes, i'm surprised this was implemented in a way
> that the DC's pacemakerd must be alive & well in order to detect the other
> pacemaker-*d daemon stalls.
> 
> @kgaillot is there any way for the other nodes (ie. not the DC
> itself) to detect that the DC's daemons are stalled? or is the pacemakerd
> code considered simple enough (read: practically impossible to
> deadlock/stall due to eg. disk/network/other blocking operations)?

That's correct, this fix applies only to the subdaemons, not pacemakerd itself. The idea is that clusters can use sbd to monitor pacemakerd itself. And of course systemd will respawn pacemakerd if it crashes.

Comment 25 Patrik Hagara 2022-02-25 16:17:16 UTC

moving to verified as per https://bugzilla.redhat.com/show_bug.cgi?id=1707851#c23 and https://bugzilla.redhat.com/show_bug.cgi?id=1707851#c24

Comment 27 errata-xmlrpc 2022-05-17 12:20:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: pacemaker), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2293

Note You need to log in before you can comment on or make changes to this bug.