Bug 1707851
Summary: | unsatisfactory recovery from pacemaker-daemons stalled via SIGSTOP | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Klaus Wenninger <kwenning> | |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 9.0 | CC: | ccaulfie, cluster-maint, jseunghw, kgaillot, msmazova, phagara | |
Target Milestone: | rc | Keywords: | Triaged | |
Target Release: | 9.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-2.1.2-3.el9 | Doc Type: | Enhancement | |
Doc Text: |
Feature: Pacemaker now monitors its component subdaemons for IPC responsiveness.
Reason: Previously, if a daemon stopped being responsive (for example, after receiving a SIGSTOP signal), the cluster might not detect any problem.
Result: Now, Pacemaker will detect unresponsive subdaemons and recover them if necessary.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2031865 (view as bug list) | Environment: | ||
Last Closed: | 2022-05-17 12:20:40 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2031865 | |||
Bug Blocks: |
Description
Klaus Wenninger
2019-05-08 14:49:16 UTC
Behaviour with stalling corosync-daemon is btw. a little different. Left over nodes will form a new partition with a new DC that decides then to fence the node with corosync stalled. Of course stalling corosync will break the path from the cib of the new DC back to the cib of the node with corosync stalled. Thus when using sbd with watchdog-fencing pacemaker-watcher isn't gonna read the 'unclean' state from the cib and thus won't trigger self-fencing. This is where bz1702727 - sbd doesn't detect non-responsive corosync-daemon comes into the game. It turns out this will require changes in libqb for a full fix. This bz might end up getting bumped to 8.7, or we might implement a partial fix for 8.6. The fix for this depends on the libqb feature in Bug 2031865, which will likely land in 9.0 but not make RHEL 8 until 8.7, so this bz is being re-targeted to 9.0. Fixed upstream as of commit 4b60aa100 before ====== > [root@virt-146 ~]# rpm -q pacemaker libqb > pacemaker-2.1.0-8.el8.x86_64 > libqb-1.0.3-12.el8.x86_64 > [root@virt-146 ~]# pcs status > Cluster name: STSRHTS15235 > Cluster Summary: > * Stack: corosync > * Current DC: virt-146 (version 2.1.0-8.el8-7c3f660707) - partition with quorum > * Last updated: Fri Feb 25 10:44:32 2022 > * Last change: Fri Feb 25 10:24:02 2022 by root via cibadmin on virt-144 > * 3 nodes configured > * 3 resource instances configured > > Node List: > * Online: [ virt-144 virt-145 virt-146 ] > > Full List of Resources: > * fence-virt-144 (stonith:fence_xvm): Started virt-144 > * fence-virt-145 (stonith:fence_xvm): Started virt-145 > * fence-virt-146 (stonith:fence_xvm): Started virt-146 > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-146 ~]# killall -STOP pacemakerd pacemaker-based pacemaker-fenced pacemaker-execd pacemaker-attrd pacemaker-schedulerd pacemaker-controld > [root@virt-146 ~]# ps faux | grep pacemaker > root 50480 0.0 0.2 134948 10236 ? Ts 10:23 0:00 /usr/sbin/pacemakerd > haclust+ 50481 0.0 0.5 156104 22728 ? Ts 10:23 0:00 \_ /usr/libexec/pacemaker/pacemaker-based > root 50482 0.0 0.3 154060 15440 ? Ts 10:23 0:00 \_ /usr/libexec/pacemaker/pacemaker-fenced > root 50483 0.0 0.2 116964 10176 ? Ts 10:23 0:00 \_ /usr/libexec/pacemaker/pacemaker-execd > haclust+ 50484 0.0 0.2 145064 12360 ? Ts 10:23 0:00 \_ /usr/libexec/pacemaker/pacemaker-attrd > haclust+ 50485 0.0 0.6 160532 26332 ? Ts 10:23 0:00 \_ /usr/libexec/pacemaker/pacemaker-schedulerd > haclust+ 50486 0.0 0.4 202912 17812 ? Ts 10:23 0:00 \_ /usr/libexec/pacemaker/pacemaker-controld > root 56689 0.0 0.0 25980 3352 ? S 10:58 0:00 \_ sh -c ps faux | grep pacemaker > root 56691 0.0 0.0 12136 1044 ? S 10:58 0:00 \_ grep pacemaker result: minutes pass, stalled DC does not get fenced, other nodes log nothing at all. after ===== > [root@virt-499 ~]# rpm -q pacemaker libqb > pacemaker-2.1.2-4.el9.x86_64 > libqb-2.0.3-7.el9.x86_64 > [root@virt-499 ~]# pcs status > Cluster name: STSRHTS12845 > Cluster Summary: > * Stack: corosync > * Current DC: virt-499 (version 2.1.2-4.el9-ada5c3b36e2) - partition with quorum > * Last updated: Fri Feb 25 11:21:58 2022 > * Last change: Fri Feb 25 10:20:41 2022 by root via cibadmin on virt-497 > * 3 nodes configured > * 3 resource instances configured > > Node List: > * Online: [ virt-497 virt-498 virt-499 ] > > Full List of Resources: > * fence-virt-497 (stonith:fence_xvm): Started virt-497 > * fence-virt-498 (stonith:fence_xvm): Started virt-498 > * fence-virt-499 (stonith:fence_xvm): Started virt-499 > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-499 ~]# killall -STOP pacemakerd pacemaker-based pacemaker-fenced pacemaker-execd pacemaker-attrd pacemaker-schedulerd pacemaker-controld > [root@virt-499 ~]# ps faux | grep pacemaker > root 62034 0.0 0.0 6416 2208 pts/0 S+ 11:22 0:00 \_ grep --color=auto pacemaker > root 54199 0.0 0.2 32312 11580 ? Ts 10:20 0:01 /usr/sbin/pacemakerd > haclust+ 54200 0.0 0.6 49468 24768 ? Ts 10:20 0:01 \_ /usr/libexec/pacemaker/pacemaker-based > root 54201 0.0 0.4 41588 17456 ? Ts 10:20 0:01 \_ /usr/libexec/pacemaker/pacemaker-fenced > root 54202 0.0 0.3 26632 12200 ? Ts 10:20 0:01 \_ /usr/libexec/pacemaker/pacemaker-execd > haclust+ 54203 0.0 0.3 39464 15280 ? Ts 10:20 0:01 \_ /usr/libexec/pacemaker/pacemaker-attrd > haclust+ 54204 0.0 0.7 62092 28464 ? Ts 10:20 0:01 \_ /usr/libexec/pacemaker/pacemaker-schedulerd > haclust+ 54205 0.0 0.4 90128 20088 ? Ts 10:20 0:01 \_ /usr/libexec/pacemaker/pacemaker-controld result: same as before the fix, rest of the cluster does not notice DC is stalled. only after unblocking the pacemakerd process (but not the other pacemaker-{base,fence,exec,attr,scheduler,control}d daemons) using `killall -CONT pacemakerd`, the DC is finally fenced with a few seconds delay... (before this fix, unblocking pacemakerd on the DC had no effect, cluster remained in the "zombie" state) still, this seems like only a marginal improvement compared to the previous behavior. peeking at the code changes, i'm surprised this was implemented in a way that the DC's pacemakerd must be alive & well in order to detect the other pacemaker-*d daemon stalls. @kgaillot is there any way for the other nodes (ie. not the DC itself) to detect that the DC's daemons are stalled? or is the pacemakerd code considered simple enough (read: practically impossible to deadlock/stall due to eg. disk/network/other blocking operations)? > peeking at the code changes, i'm surprised this was implemented in a way
> that the DC's pacemakerd must be alive & well in order to detect the other
> pacemaker-*d daemon stalls.
>
> @kgaillot is there any way for the other nodes (ie. not the DC
> itself) to detect that the DC's daemons are stalled? or is the pacemakerd
> code considered simple enough (read: practically impossible to
> deadlock/stall due to eg. disk/network/other blocking operations)?
That's correct, this fix applies only to the subdaemons, not pacemakerd itself. The idea is that clusters can use sbd to monitor pacemakerd itself. And of course systemd will respawn pacemakerd if it crashes.
moving to verified as per https://bugzilla.redhat.com/show_bug.cgi?id=1707851#c23 and https://bugzilla.redhat.com/show_bug.cgi?id=1707851#c24 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: pacemaker), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2293 |