Bug 1956687 - Pacemaker shuts down if a node is in the CPG membership when a stonith action against it completes
Summary: Pacemaker shuts down if a node is in the CPG membership when a stonith action...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: beta
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-04 08:44 UTC by Reid Wahl
Modified: 2021-05-04 14:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)

Description Reid Wahl 2021-05-04 08:44:51 UTC
Description of problem:

This issue is closely related to BZ1941108. This one might be trickier to address. I've included a workaround at the bottom of this description.

If a node is in the CPG membership when a stonith action reboots it, and then the node reboots and starts corosync before a new membership is formed, then its pacemaker process shuts itself down. A high token timeout makes this more likely to occur.

In the below scenario, node 1 (8-0-23) fences node 2 (8-0-24) due to token loss. Node 2 rejoins the membership after fencing is initiated but before it reboots. When the reboot succeeds, node 1 tries to send the remote_op_done message to the whole CPG membership but gets stuck trying to send to node 2, which is unresponsive, so the stonith operation hangs. When node 2 boots up and starts corosync and pacemaker, it receives the message, detects it has been fenced, and shuts itself off. Then, manual intervention is required to bring it back online.

May  2 23:32:51 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
May  2 23:33:15 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116da) was formed. Members left: 2
May  2 23:33:15 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] Failed to receive the leave message. failed: 2
May  2 23:33:15 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 1 received
May  2 23:33:15 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
May  2 23:33:15 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
May  2 23:33:16 fastvm-rhel-8-0-23 pacemaker-schedulerd[1739]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
...
May  2 23:33:17 fastvm-rhel-8-0-23 corosync[1722]:  [KNET  ] rx: host: 2 link: 0 is up
May  2 23:33:17 fastvm-rhel-8-0-23 corosync[1722]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May  2 23:33:17 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116de) was formed. Members joined: 2
May  2 23:33:17 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 1 received
May  2 23:33:17 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 0 received
May  2 23:33:17 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
May  2 23:33:17 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
...
May  2 23:33:37 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' [34077] (call 24 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK)
...
May  2 23:33:47 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] Token has not been received in 15000 ms
May  2 23:33:52 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
...
May  2 23:34:02 fastvm-rhel-8-0-23 corosync[1722]:  [KNET  ] rx: host: 2 link: 0 is up
May  2 23:34:02 fastvm-rhel-8-0-23 corosync[1722]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May  2 23:34:02 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116e7) was formed. Members joined: 2 left: 2
May  2 23:34:02 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] Failed to receive the leave message. failed: 2
May  2 23:34:02 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 0 received
May  2 23:34:02 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 1 received
...
May  2 23:34:02 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
May  2 23:34:02 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.


May 02 23:33:50 [1086] fastvm-rhel-8-0-24 corosync notice  [MAIN  ] Corosync Cluster Engine 3.0.3 starting up
...
May 02 23:34:02 [1086] fastvm-rhel-8-0-24 corosync notice  [QUORUM] Members[2]: 1 2

May 02 23:34:02 fastvm-rhel-8-0-24 pacemaker-fenced    [1330] (remote_op_done)  notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK | id=0a662097
May 02 23:34:02 fastvm-rhel-8-0-24 pacemaker-controld  [1334] (tengine_stonith_notify)  crit: We were allegedly just fenced by node1 for node1!
May 02 23:34:02 fastvm-rhel-8-0-24 pacemakerd          [1307] (pcmk_child_exit)     warning: Shutting cluster down because pacemaker-controld[1334] had fatal failure

-----

Version-Release number of selected component (if applicable):

Testing on master branch. The reporting customer encountered this on pacemaker-2.0.3-5.el8_2.3 on Google Cloud Platform.

-----

How reproducible:

Often. Always, if timing is correct.

-----

Steps to Reproduce:

I suggest reproducing on small VMs (on-prem or cloud). This way, the nodes should be able to reboot quickly.

I'm using the following stonith config when testing with node2 getting fenced. This way, node 1 won't get fenced but node 2 takes 20 seconds, which gives us some time for corosync membership changes to cause a problem.

 Resource: xvm1 (class=stonith type=fence_xvm)
  Attributes: pcmk_delay_base=120s pcmk_host_map=node1:fastvm-rhel-8.0-23 pcmk_reboot_timeout=240s
  Operations: monitor interval=60s (xvm1-monitor-interval-60s)
 Resource: xvm2 (class=stonith type=fence_xvm)
  Attributes: pcmk_delay_base=20s pcmk_host_map=node2:fastvm-rhel-8.0-24 pcmk_reboot_timeout=240s
  Operations: monitor interval=60s (xvm2-monitor-interval-60s)


I'm using a corosync token timeout of 20000. The consensus timeout (the amount of time it takes for a new membership to form after a token loss) defaults to (1.2 * token_timeout == 24000).

The cluster is enabled (`pcs cluster enable --all`) to start at boot on both nodes.


From a third machine, I run the following:

ssh root@node2 'date && firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP; date && sleep 30; date && firewall-cmd --direct --remove-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP'

Output:
Tue May  4 01:27:37 PDT 2021
success
Tue May  4 01:27:38 PDT 2021
Tue May  4 01:28:08 PDT 2021
success

-----

Actual results:

Node 2 gets fenced. When it boots up and starts pacemaker, it receives the remote_op_done message and kills its pacemaker instance.

Node 1:
May  4 01:27:57 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
May  4 01:28:19 fastvm-rhel-8-0-23 corosync[1722]:  [KNET  ] rx: host: 2 link: 0 is up
May  4 01:28:19 fastvm-rhel-8-0-23 corosync[1722]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116f4) was formed. Members left: 2
May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] Failed to receive the leave message. failed: 2
May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 1 received
May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
...
May  4 01:28:22 fastvm-rhel-8-0-23 pacemaker-schedulerd[1739]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
...
May  4 01:28:22 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Delaying 'reboot' action targeting node2 using xvm2 for 20s
May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116f8) was formed. Members joined: 2
May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 1 received
May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 0 received
May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
...
May  4 01:28:45 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' [43895] (call 28 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK)
...
May  4 01:29:01 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
...
May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [KNET  ] rx: host: 2 link: 0 is up
May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116fc) was formed. Members joined: 2 left: 2
May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] Failed to receive the leave message. failed: 2
May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 0 received
May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 1 received
...
May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
May  4 01:29:09 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK
May  4 01:29:09 fastvm-rhel-8-0-23 pacemaker-controld[1740]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1740: OK
...
May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [CFG   ] Node 2 was shut down by sysadmin
May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.11700) was formed. Members left: 2
May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [CPG   ] downlist left_list: 1 received
May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.



Node 2:
May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice  [TOTEM ] A new membership (1.116fc) was formed. Members joined: 1
...
May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice  [QUORUM] Members[2]: 1 2
...
May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-fenced    [1319] (remote_op_done)  notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK | id=b69b57a1
May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-controld  [1323] (tengine_stonith_notify)  crit: We were allegedly just fenced by node1 for node1!

-----

Expected results:

Node 2's pacemaker remains online after it reboots due to fencing.

-----

Workaround:

Add a startup delay to the systemd corosync.service. That way, corosync doesn't start immediately when the fenced node boots, so there's time for a new CPG membership to form without the fenced node, and the stonith action can complete. See this article[1] for details.

For this to work without causing other issues, (e.g., the one described in BZ1941108), the pcmk_reboot_timeout should also be increased, as described in this article[2].


[1] A node shuts down Pacemaker after getting fenced by fence_gce and rejoining the cluster on Google Cloud Platform (https://access.redhat.com/solutions/5644441)
[2] A fenced node was rebooted twice in a row in a Pacemaker cluster (https://access.redhat.com/solutions/5895361)


Note You need to log in before you can comment on or make changes to this bug.