Bug 1725236
| Summary: | Pacemaker does not recover from the scheduler not replying to the controller | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Ken Gaillot <kgaillot> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 8.0 | CC: | cluster-maint, cluster-qe, kgaillot, knoha, mnovacek, msmazova, phagara, pzimek, sbradley |
| Target Milestone: | pre-dev-freeze | Flags: | pm-rhel:
mirror+
|
| Target Release: | 8.2 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-2.0.3-1.el8 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: When Pacemaker's controller process sent a request to the scheduler process, it waited indefinitely for a response.
Consequence: If the scheduler is unable to reply (for example, if the kernel has blocked it), the cluster would be unable to take any further action.
Fix: The controller now puts a 2-minute timeout on the scheduler's response.
Result: If the scheduler cannot reply, the controller will exit after 2 minutes, allowing the rest of the cluster to fence the node and proceed with recovery.
|
Story Points: | --- |
| Clone Of: | 1596125 | Environment: | |
| Last Closed: | 2020-04-28 15:38:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1596125, 1752538 | ||
| Bug Blocks: | 1707454, 1776957 | ||
|
Description
Ken Gaillot
2019-06-28 19:02:13 UTC
Fixed upstream by commit 8ba3075 after fix > [root@duck-01 11:44:14 ~]# rpm -q pacemaker > pacemaker-2.0.3-4.el8.x86_64 > [root@duck-01 11:44:47 ~]# pcs status > Cluster name: duck > Cluster Summary: > * Stack: corosync > * Current DC: duck-02 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum > * Last updated: Wed Jan 29 11:45:02 2020 > * Last change: Wed Jan 29 11:43:53 2020 by root via cibadmin on duck-01 > * 3 nodes configured > * 12 resource instances configured > Node List: > * Online: [ duck-01 duck-02 duck-03 ] > Full List of Resources: > * fencing-duck01 (stonith:fence_ipmilan): Started duck-01 > * fencing-duck02 (stonith:fence_ipmilan): Started duck-02 > * fencing-duck03 (stonith:fence_ipmilan): Started duck-03 > * Clone Set: locking-clone [locking]: > * Started: [ duck-01 duck-02 duck-03 ] > * first (ocf::pacemaker:Dummy): Started duck-01 > * second (ocf::pacemaker:Dummy): Started duck-02 > * third (ocf::pacemaker:Dummy): Started duck-03 > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Current DC is duck-02. > [root@duck-01 11:46:24 ~]# qarsh duck-02 'echo quiesce > /sys/class/block/sda/device/state' > [root@duck-01 11:48:08 ~]# qarsh duck-02 'rm -f /run/Dummy-second.state' Resource on node duck-02 is marked as failed. > [root@duck-01 11:49:46 ~]# pcs status > Cluster name: duck > Cluster Summary: > * Stack: corosync > * Current DC: duck-02 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum > * Last updated: Wed Jan 29 11:49:49 2020 > * Last change: Wed Jan 29 11:43:53 2020 by root via cibadmin on duck-01 > * 3 nodes configured > * 12 resource instances configured > Node List: > * Online: [ duck-01 duck-02 duck-03 ] > Full List of Resources: > * fencing-duck01 (stonith:fence_ipmilan): Started duck-01 > * fencing-duck02 (stonith:fence_ipmilan): Started duck-02 > * fencing-duck03 (stonith:fence_ipmilan): Started duck-03 > * Clone Set: locking-clone [locking]: > * Started: [ duck-01 duck-02 duck-03 ] > * first (ocf::pacemaker:Dummy): Started duck-01 > * second (ocf::pacemaker:Dummy): FAILED duck-02 > * third (ocf::pacemaker:Dummy): Started duck-03 > Failed Resource Actions: > * second_monitor_10000 on duck-02 'not running' (7): call=39, status='complete', exitreason='', last-rc-change='2020-01-29 11:48:22 +01:00', queued=0ms, exec=0ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Two minutes later, node duck-02 is fenced. > [root@duck-03 11:50:46 ~]# crm_mon -m 3 > Cluster Summary: > * Stack: corosync > * Current DC: duck-01 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum > * Last updated: Wed Jan 29 11:51:41 2020 > * Last change: Wed Jan 29 11:43:53 2020 by root via cibadmin on duck-01 > * 3 nodes configured > * 12 resource instances configured > Node List: > * Online: [ duck-01 duck-03 ] > * OFFLINE: [ duck-02 ] > Active Resources: > * fencing-duck01 (stonith:fence_ipmilan): Started duck-01 > * fencing-duck02 (stonith:fence_ipmilan): Started duck-03 > * fencing-duck03 (stonith:fence_ipmilan): Started duck-03 > * Clone Set: locking-clone [locking]: > * Started: [ duck-01 duck-03 ] > * first (ocf::pacemaker:Dummy): Started duck-01 > * second (ocf::pacemaker:Dummy): Started duck-01 > * third (ocf::pacemaker:Dummy): Started duck-03 > Fencing History: > * reboot of duck-02 successful: delegate=duck-03, client=pacemaker-controld.14440, origin=duck-01, completed='2020-01-29 11:50:35 +01:00' After few minutes node duck-02 is rebooted, resource is restarted. > > [root@duck-01 11:59:13 ~]# pcs status > Cluster name: duck > Cluster Summary: > * Stack: corosync > * Current DC: duck-01 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum > * Last updated: Wed Jan 29 11:59:20 2020 > * Last change: Wed Jan 29 11:43:53 2020 by root via cibadmin on duck-01 > * 3 nodes configured > * 12 resource instances configured > Node List: > * Online: [ duck-01 duck-02 duck-03 ] > Full List of Resources: > * fencing-duck01 (stonith:fence_ipmilan): Started duck-01 > * fencing-duck02 (stonith:fence_ipmilan): Started duck-03 > * fencing-duck03 (stonith:fence_ipmilan): Started duck-02 > * Clone Set: locking-clone [locking]: > * Started: [ duck-01 duck-02 duck-03 ] > * first (ocf::pacemaker:Dummy): Started duck-01 > * second (ocf::pacemaker:Dummy): Started duck-02 > * third (ocf::pacemaker:Dummy): Started duck-03 > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > log excerpts: > Jan 29 11:48:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based [14435] (cib_perform_op) info: ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']: <nvpair id="status-2-last-failure-second.monitor_10000" name="last-failure-second#monitor_10000" value="1580294902"/> > Jan 29 11:48:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based [14435] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=duck-01/attrd/6, version=0.16.13) > Jan 29 11:48:27 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based [14435] (cib_process_ping) info: Reporting our current digest to duck-02: b7bd9edb459302e6699948e555eaa0ac for 0.16.13 (0x55eb19aa2950 0) [... two minutes later...] > Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (pcmk_cpg_membership) info: Group crmd event 3: duck-02 (node 2 pid 14435) left for unknown reason > Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (crm_update_peer_proc) info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now offline > Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (peer_update_callback) info: Node duck-02 is no longer a peer | DC=duck-02 old=0x4000000 new=0x0000000 > Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (peer_update_callback) notice: Our peer on the DC (duck-02) is dead > Jan 29 11:50:22 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (update_dc) info: Unset DC. Was duck-02 > Jan 29 11:50:24 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced [14436] (initiate_remote_stonith_op) notice: Requesting peer fencing (reboot) targeting duck-02 | id=e06fd919-56db-4e3e-9bcf-bdd39f3e6e1b state=0 > Jan 29 11:50:24 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced [14436] (can_fence_host_with_device) notice: fencing-duck02 is eligible to fence (reboot) duck-02: static-list > Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based [14435] (pcmk_cpg_membership) info: Group cib event 3: duck-02 (node 2 pid 14430) left via cluster exit > Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based [14435] (crm_update_peer_proc) info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now offline > Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemakerd [14434] (pcmk_cpg_membership) info: Group pacemakerd event 3: duck-02 (node 2 pid 14429) left via cluster exit > Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemakerd [14434] (crm_update_peer_proc) info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now offline > Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced [14436] (pcmk_cpg_membership) info: Group stonith-ng event 3: duck-02 (node 2 pid 14431) left via cluster exit > Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced [14436] (crm_update_peer_proc) info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now offline > Jan 29 11:50:31 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (peer_update_callback) info: Cluster node duck-02 is now lost (was member) [...three minutes later...] > Jan 29 11:53:19 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (peer_update_callback) info: Cluster node duck-02 is now member (was lost) > Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemakerd [14434] (pcmk_cpg_membership) info: Group pacemakerd event 4: duck-02 (node 2 pid 2196) is member > Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemakerd [14434] (crm_update_peer_proc) info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now online > Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced [14436] (pcmk_cpg_membership) info: Group stonith-ng event 4: duck-02 (node 2 pid 2274) is member > Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced [14436] (crm_update_peer_proc) info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now online > Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based [14435] (pcmk_cpg_membership) info: Group cib event 4: duck-02 (node 2 pid 2273) is member > Jan 29 11:53:20 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-based [14435] (crm_update_peer_proc) info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now online > Jan 29 11:53:21 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (pcmk_cpg_membership) info: Group crmd event 4: duck-02 (node 2 pid 2280) is member > Jan 29 11:53:21 duck-01.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [14440] (crm_update_peer_proc) info: pcmk_cpg_membership: Node duck-02[2] - corosync-cpg is now online Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:1609 |