1725236 – Pacemaker does not recover from the scheduler not replying to the controller

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1725236 - Pacemaker does not recover from the scheduler not replying to the controller

Summary: Pacemaker does not recover from the scheduler not replying to the controller

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	pre-dev-freeze
Target Release:	8.2
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1596125 1752538
Blocks:	1707454 1776957
TreeView+	depends on / blocked

Reported:	2019-06-28 19:02 UTC by Ken Gaillot
Modified:	2021-07-10 14:02 UTC (History)
CC List:	9 users (show)
Fixed In Version:	pacemaker-2.0.3-1.el8
Doc Type:	Bug Fix
Doc Text:	Cause: When Pacemaker's controller process sent a request to the scheduler process, it waited indefinitely for a response. Consequence: If the scheduler is unable to reply (for example, if the kernel has blocked it), the cluster would be unable to take any further action. Fix: The controller now puts a 2-minute timeout on the scheduler's response. Result: If the scheduler cannot reply, the controller will exit after 2 minutes, allowing the rest of the cluster to fence the node and proceed with recovery.
Clone Of:	1596125
Environment:
Last Closed:	2020-04-28 15:38:28 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2020:1609	0	None	None	None	2020-04-28 15:39:10 UTC

Description Ken Gaillot 2019-06-28 19:02:13 UTC

+++ This bug was initially created as a clone of Bug #1596125 +++

--- Additional comment from Ken Gaillot on 2019-06-28 18:59:22 UTC ---

The QA failure is a related but separate issue. During that incident, the pengine got paused shortly after the disk was quiesced and remained paused until the disk was active again. The pengine was functioning long enough for the crmd to successfully send it the CIB for processing, but then got paused and never replied. Surprisingly, the crmd currently doesn't have any timeout on the pengine's response, which is why no recovery happens.

Comment 1 Ken Gaillot 2019-07-11 19:23:17 UTC

Fixed upstream by commit 8ba3075

Comment 5 Markéta Smazová 2020-01-29 16:01:50 UTC

after fix

>   [root@duck-01 11:44:14 ~]# rpm -q pacemaker
>   pacemaker-2.0.3-4.el8.x86_64


>   [root@duck-01 11:44:47 ~]# pcs status
>   Cluster name: duck
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: duck-02 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum
>     * Last updated: Wed Jan 29 11:45:02 2020
>     * Last change:  Wed Jan 29 11:43:53 2020 by root via cibadmin on duck-01
>     * 3 nodes configured
>     * 12 resource instances configured

>   Node List:
>     * Online: [ duck-01 duck-02 duck-03 ]

>   Full List of Resources:
>     * fencing-duck01	(stonith:fence_ipmilan):	Started duck-01
>     * fencing-duck02	(stonith:fence_ipmilan):	Started duck-02
>     * fencing-duck03	(stonith:fence_ipmilan):	Started duck-03
>     * Clone Set: locking-clone [locking]:
>       * Started: [ duck-01 duck-02 duck-03 ]
>     * first	(ocf::pacemaker:Dummy):	Started duck-01
>     * second	(ocf::pacemaker:Dummy):	Started duck-02
>     * third	(ocf::pacemaker:Dummy):	Started duck-03

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Current DC is duck-02.

>   [root@duck-01 11:46:24 ~]# qarsh duck-02 'echo quiesce > /sys/class/block/sda/device/state'
>   [root@duck-01 11:48:08 ~]# qarsh duck-02 'rm -f /run/Dummy-second.state'

Resource on node duck-02 is marked as failed.

>   [root@duck-01 11:49:46 ~]# pcs status
>   Cluster name: duck
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: duck-02 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum
>     * Last updated: Wed Jan 29 11:49:49 2020
>     * Last change:  Wed Jan 29 11:43:53 2020 by root via cibadmin on duck-01
>     * 3 nodes configured
>     * 12 resource instances configured

>   Node List:
>     * Online: [ duck-01 duck-02 duck-03 ]

>   Full List of Resources:
>     * fencing-duck01	(stonith:fence_ipmilan):	Started duck-01
>     * fencing-duck02	(stonith:fence_ipmilan):	Started duck-02
>     * fencing-duck03	(stonith:fence_ipmilan):	Started duck-03
>     * Clone Set: locking-clone [locking]:
>       * Started: [ duck-01 duck-02 duck-03 ]
>     * first	(ocf::pacemaker:Dummy):	Started duck-01
>     * second	(ocf::pacemaker:Dummy):	 FAILED duck-02
>     * third	(ocf::pacemaker:Dummy):	Started duck-03

>   Failed Resource Actions:
>     * second_monitor_10000 on duck-02 'not running' (7): call=39, status='complete', exitreason='', last-rc-change='2020-01-29 11:48:22 +01:00', queued=0ms, exec=0ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Two minutes later, node duck-02 is fenced.

>   [root@duck-03 11:50:46 ~]# crm_mon -m 3

>   Cluster Summary:
>     * Stack: corosync
>     * Current DC:	duck-01 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum
>     * Last updated: Wed Jan 29 11:51:41 2020
>     * Last change:  Wed Jan 29 11:43:53 2020 by root via cibadmin on duck-01
>     * 3 nodes configured
>     * 12 resource instances configured

>   Node List:
>     * Online: [ duck-01 duck-03 ]
>     * OFFLINE: [ duck-02 ]

>   Active Resources:
>     * fencing-duck01	(stonith:fence_ipmilan):        Started duck-01
>     * fencing-duck02	(stonith:fence_ipmilan):        Started duck-03
>     * fencing-duck03	(stonith:fence_ipmilan):        Started duck-03
>     * Clone Set: locking-clone [locking]:
>       * Started: [ duck-01 duck-03 ]
>     * first	(ocf::pacemaker:Dummy): Started duck-01
>     * second	(ocf::pacemaker:Dummy): Started duck-01
>     * third	(ocf::pacemaker:Dummy): Started duck-03

>   Fencing History:
>     * reboot of duck-02 successful: delegate=duck-03, client=pacemaker-controld.14440, origin=duck-01, completed='2020-01-29 11:50:35 +01:00'

After few minutes node duck-02 is rebooted, resource is restarted.
>     
>   [root@duck-01 11:59:13 ~]# pcs status
>   Cluster name: duck
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: duck-01 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum
>     * Last updated: Wed Jan 29 11:59:20 2020
>     * Last change:  Wed Jan 29 11:43:53 2020 by root via cibadmin on duck-01
>     * 3 nodes configured
>     * 12 resource instances configured

>   Node List:
>     * Online: [ duck-01 duck-02 duck-03 ]

>   Full List of Resources:
>     * fencing-duck01	(stonith:fence_ipmilan):	Started duck-01
>     * fencing-duck02	(stonith:fence_ipmilan):	Started duck-03
>     * fencing-duck03	(stonith:fence_ipmilan):	Started duck-02
>     * Clone Set: locking-clone [locking]:
>       * Started: [ duck-01 duck-02 duck-03 ]
>     * first	(ocf::pacemaker:Dummy):	Started duck-01
>     * second	(ocf::pacemaker:Dummy):	Started duck-02
>     * third	(ocf::pacemaker:Dummy):	Started duck-03

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled
>     
>     
log excerpts:

>   Jan 29 11:48:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based     [14435] (cib_perform_op) 	info: ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']:  <nvpair id="status-2-last-failure-second.monitor_10000" name="last-failure-second#monitor_10000" value="1580294902"/>
>   Jan 29 11:48:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based     [14435] (cib_process_request) 	info: Completed cib_modify operation for section status: OK (rc=0, origin=duck-01/attrd/6, version=0.16.13)
>   Jan 29 11:48:27 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based     [14435] (cib_process_ping) 	info: Reporting our current digest to duck-02: b7bd9edb459302e6699948e555eaa0ac for 0.16.13 (0x55eb19aa2950 0)

[... two minutes later...]

>   Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (pcmk_cpg_membership) 	info: Group crmd event 3: duck-02 (node 2 pid 14435) left for unknown reason
>   Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (crm_update_peer_proc) 	info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now offline
>   Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (peer_update_callback) 	info: Node duck-02 is no longer a peer | DC=duck-02 old=0x4000000 new=0x0000000
>   Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (peer_update_callback) 	notice: Our peer on the DC (duck-02) is dead
>   Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (update_dc) 	info: Unset DC. Was duck-02

>   Jan 29 11:50:24 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced    [14436] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (reboot) targeting duck-02 | id=e06fd919-56db-4e3e-9bcf-bdd39f3e6e1b state=0
>   Jan 29 11:50:24 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced    [14436] (can_fence_host_with_device) 	notice: fencing-duck02 is eligible to fence (reboot) duck-02: static-list
>   Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based     [14435] (pcmk_cpg_membership) 	info: Group cib event 3: duck-02 (node 2 pid 14430) left via cluster exit
>   Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based     [14435] (crm_update_peer_proc) 	info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now offline
>   Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [14434] (pcmk_cpg_membership) 	info: Group pacemakerd event 3: duck-02 (node 2 pid 14429) left via cluster exit
>   Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [14434] (crm_update_peer_proc) 	info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now offline
>   Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced    [14436] (pcmk_cpg_membership) 	info: Group stonith-ng event 3: duck-02 (node 2 pid 14431) left via cluster exit
>   Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced    [14436] (crm_update_peer_proc) 	info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now offline
>   Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (peer_update_callback) 	info: Cluster node duck-02 is now lost (was member)

[...three minutes later...]

>   Jan 29 11:53:19 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (peer_update_callback) 	info: Cluster node duck-02 is now member (was lost)
>   Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [14434] (pcmk_cpg_membership) 	info: Group pacemakerd event 4: duck-02 (node 2 pid 2196) is member
>   Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [14434] (crm_update_peer_proc) 	info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now online
>   Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced    [14436] (pcmk_cpg_membership) 	info: Group stonith-ng event 4: duck-02 (node 2 pid 2274) is member
>   Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced    [14436] (crm_update_peer_proc) 	info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now online
>   Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based     [14435] (pcmk_cpg_membership) 	info: Group cib event 4: duck-02 (node 2 pid 2273) is member
>   Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based     [14435] (crm_update_peer_proc) 	info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now online
>   Jan 29 11:53:21 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (pcmk_cpg_membership) 	info: Group crmd event 4: duck-02 (node 2 pid 2280) is member
>   Jan 29 11:53:21 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [14440] (crm_update_peer_proc) 	info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now online

Comment 7 errata-xmlrpc 2020-04-28 15:38:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:1609

Note You need to log in before you can comment on or make changes to this bug.