Bug 2048811 - No fencing action triggered when rebooting a cluster node with fencing agent
Summary: No fencing action triggered when rebooting a cluster node with fencing agent
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-31 21:50 UTC by lpham
Modified: 2023-07-03 15:21 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-03 15:21:33 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
crm_report -S for both cluster nodes (299.20 KB, application/x-bzip)
2022-02-02 17:02 UTC, lpham
no flags Details
crm_report -S for lanpcmk-srv-1 (540.72 KB, application/x-bzip)
2022-02-02 17:07 UTC, lpham
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-110474 0 None None None 2022-01-31 21:56:25 UTC

Description lpham 2022-01-31 21:50:22 UTC
Description of problem:
Fencing agent was configured for the cluster
The cluster has 2 cluster nodes (with Corosync) and 2 Pacemaker remote nodes
All hosts were running using VMs (virtual machine) running RHEL 8.2
When rebooting (using "reboot -f") a VM host that runs as a cluster node, it was expected that fencing would be triggered for the rebooting host, but that did not happen.

Version-Release number of selected component (if applicable):
RHEL 8.2

How reproducible:
Every time

Steps to Reproduce:
1. Configure a Pacemaker with fence agent
2. Reboot a cluster node (a VM host) with reboot -f
3. The host was rebooted, but no fencing action triggered

Actual results:
No fence action triggered when reboot a cluster node in a cluster with fencing agent configured.

Expected results:
It's expected that fencing action would be triggered on a host reboot.

Additional info:
I saw the following entries in the /var/log/messages (on the other cluster node):
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-based[1589]: notice: Node lanpcmk-srv-2 state is now lost
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-based[1589]: notice: Purged 1 peer with id=2 and/or uname=lanpcmk-srv-2 from the membership cache
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-attrd[1594]: notice: Node lanpcmk-srv-2 state is now lost
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-attrd[1594]: notice: Removing all lanpcmk-srv-2 attributes for peer loss
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-attrd[1594]: notice: Purged 1 peer with id=2 and/or uname=lanpcmk-srv-2 from the membership cache
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-controld[1597]: notice: Node lanpcmk-srv-2 state is now lost
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-controld[1597]: warning: Stonith/shutdown of node lanpcmk-srv-2 was not expected
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-fenced[1590]: notice: Node lanpcmk-srv-2 state is now lost
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-fenced[1590]: notice: Purged 1 peer with id=2 and/or uname=lanpcmk-srv-2 from the membership cache
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-controld[1597]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 28 08:43:15 lanpcmk-srv-1 pacemaker-controld[1597]: warning: Stonith/shutdown of node lanpcmk-srv-2 was not expected

Comment 1 Reid Wahl 2022-02-01 05:43:16 UTC
(In reply to lpham from comment #0)
Can you provide a sosreport (or a crm_report if that's not possible) from each cluster node?

Comment 2 lpham 2022-02-02 17:02:55 UTC
Created attachment 1858725 [details]
crm_report -S for both cluster nodes

The cluster nodes were: lanpcmk-srv-1 and lanpcmk-srv-2
The host that was rebooted: lanpcmk-srv-2
Time of reboot: Feb 2 08:27:00

Comment 3 lpham 2022-02-02 17:07:09 UTC
Created attachment 1858727 [details]
crm_report -S for lanpcmk-srv-1

Added the missing file: missing-fence-for-fast-reboot.lanpcmk-srv-1.tar.bz1

Comment 4 Reid Wahl 2022-02-02 23:03:47 UTC
Thanks :) Node 2 rebooted, started corosync, and formed a local membership before node 1 formed a new corosync membership that excluded node 2 and requested fencing.

I can't remember if we have an existing BZ for this. I know we have a couple for CPG membership-related race conditions.

-----

Node 2 was rebooted around 08:27:14.

Feb 02 08:27:14  ca(cf_regress1_129)[1834158]:    INFO: ca[1834158]: 299: Entering CF resource monitor...
Feb 02 08:27:31.559 lanpcmk-srv-2.fyre.ibm.com pacemakerd          [1541] (crm_log_init)    info: Changed active directory to /var/lib/pacemaker/cores


Corosync totem token timeout is 10000, which means consensus defaults to 12000. We finally formed a new corosync membership that excluded node 2 at 08:27:36.

Feb  2 08:27:18 lanpcmk-srv-1 corosync[937038]:  [KNET  ] libknet.h:log_deliver_fn:682 link: host: 2 link: 0 is down
Feb  2 08:27:18 lanpcmk-srv-1 corosync[937038]:  [KNET  ] libknet.h:log_deliver_fn:682 host: host: 2 (passive) best link: 0 (pri: 1)
Feb  2 08:27:18 lanpcmk-srv-1 corosync[937038]:  [KNET  ] libknet.h:log_deliver_fn:677 host: host: 2 has no active links
Feb  2 08:27:22 lanpcmk-srv-1 corosync[937038]:  [TOTEM ] totemsrp.c:timer_function_orf_token_warning:1730 Token has not been received in 7500 ms
Feb  2 08:27:24 lanpcmk-srv-1 corosync[937038]:  [TOTEM ] totemsrp.c:timer_function_orf_token_timeout:1746 A processor failed, forming new configuration.
Feb  2 08:27:36 lanpcmk-srv-1 corosync[937038]:  [TOTEM ] totemsrp.c:memb_state_operational_enter:2096 A new membership (1.ab9) was formed. Members left: 2
Feb  2 08:27:36 lanpcmk-srv-1 corosync[937038]:  [TOTEM ] totemsrp.c:memb_state_operational_enter:2101 Failed to receive the leave message. failed: 2
Feb  2 08:27:36 lanpcmk-srv-1 corosync[937038]:  [QUORUM] vsf_quorum.c:log_view_list:131 Members[1]: 1
Feb  2 08:27:36 lanpcmk-srv-1 corosync[937038]:  [MAIN  ] main.c:corosync_sync_completed:296 Completed service synchronization, ready to provide service.


Pacemaker started to react.

Feb  2 08:27:36 lanpcmk-srv-1 pacemaker-fenced[937206]: notice: Node lanpcmk-srv-2 state is now lost
Feb  2 08:27:36 lanpcmk-srv-1 pacemaker-fenced[937206]: notice: Purged 1 peer with id=2 and/or uname=lanpcmk-srv-2 from the membership cache
Feb  2 08:27:36 lanpcmk-srv-1 pacemakerd[937199]: notice: Node lanpcmk-srv-2 state is now lost
Feb  2 08:27:36 lanpcmk-srv-1 pacemaker-controld[937210]: notice: Node lanpcmk-srv-2 state is now lost
Feb  2 08:27:36 lanpcmk-srv-1 pacemaker-controld[937210]: warning: Our DC node (lanpcmk-srv-2) left the cluster
...
Feb  2 08:27:36 lanpcmk-srv-1 pacemaker-controld[937210]: notice: State transition S_NOT_DC -> S_ELECTION
Feb  2 08:27:36 lanpcmk-srv-1 pacemaker-controld[937210]: notice: State transition S_ELECTION -> S_INTEGRATION


Node 2's corosync had already started up after reboot and formed a membership (containing only itself) at 08:27:28.

Feb  2 08:27:28 lanpcmk-srv-2 corosync[1057]:  [TOTEM ] totemsrp.c:memb_state_operational_enter:2096 A new membership (2.aba) was formed. Members joined: 2
Feb  2 08:27:28 lanpcmk-srv-2 corosync[1057]:  [QUORUM] vsf_quorum.c:log_view_list:131 Members[1]: 2
Feb  2 08:27:28 lanpcmk-srv-2 corosync[1057]:  [MAIN  ] main.c:corosync_sync_completed:296 Completed service synchronization, ready to provide service.


So almost immediately after node 1's corosync formed a membership excluding node 2 and pacemaker started to react, corosync formed a new membership that included node 2.

Feb  2 08:27:36 lanpcmk-srv-1 corosync[937038]:  [KNET  ] libknet.h:log_deliver_fn:682 rx: host: 2 link: 0 is up
Feb  2 08:27:36 lanpcmk-srv-1 corosync[937038]:  [KNET  ] libknet.h:log_deliver_fn:682 host: host: 2 (passive) best link: 0 (pri: 1)
Feb  2 08:27:37 lanpcmk-srv-1 corosync[937038]:  [TOTEM ] totemsrp.c:memb_state_operational_enter:2096 A new membership (1.abe) was formed. Members joined: 2
Feb  2 08:27:37 lanpcmk-srv-1 corosync[937038]:  [QUORUM] vsf_quorum.c:log_view_list:131 Members[2]: 1 2
Feb  2 08:27:37 lanpcmk-srv-1 corosync[937038]:  [MAIN  ] main.c:corosync_sync_completed:296 Completed service synchronization, ready to provide service.
Feb  2 08:27:37 lanpcmk-srv-1 pacemakerd[937199]: notice: Node lanpcmk-srv-2 state is now member


Looking at pacemaker.log:

Feb 02 08:27:36.519 lanpcmk-srv-1.fyre.ibm.com pacemaker-controld  [937210] (reap_dead_nodes)   warning: Our DC node (lanpcmk-srv-2) left the cluster
...
Feb 02 08:27:36.520 lanpcmk-srv-1.fyre.ibm.com pacemaker-controld  [937210] (peer_update_callback)  info: Node lanpcmk-srv-2 is no longer a peer | DC=<none> old=0x4000000 new=0x0000000
...
Feb 02 08:27:37.356 lanpcmk-srv-1.fyre.ibm.com pacemakerd          [937199] (pcmk_cpg_membership)   info: Group pacemakerd event 3: lanpcmk-srv-2 (node 2 pid 1541) is member
Feb 02 08:27:37.356 lanpcmk-srv-1.fyre.ibm.com pacemakerd          [937199] (crm_update_peer_proc)  info: pcmk_cpg_membership: Node lanpcmk-srv-2[2] - corosync-cpg is now online
Feb 02 08:27:37.627 lanpcmk-srv-1.fyre.ibm.com pacemaker-controld  [937210] (crm_update_peer_state_iter)    notice: Node lanpcmk-srv-2 state is now member | nodeid=2 previous=lost source=pcmk_quorum_notification
Feb 02 08:27:37.628 lanpcmk-srv-1.fyre.ibm.com pacemaker-controld  [937210] (join_make_offer)   info: Sending join-1 offer to lanpcmk-srv-2
Feb 02 08:27:37.661 lanpcmk-srv-1.fyre.ibm.com pacemaker-based     [937205] (cib_perform_op)    info: +  /cib/status/node_state[@id='2']:  @crmd=online, @crm-debug-origin=peer_update_callback
...
Feb 02 08:27:38.711 lanpcmk-srv-1.fyre.ibm.com pacemaker-schedulerd[937209] (determine_online_status_fencing)   info: Node lanpcmk-srv-2 is active
Feb 02 08:27:38.711 lanpcmk-srv-1.fyre.ibm.com pacemaker-schedulerd[937209] (determine_online_status)   info: Node lanpcmk-srv-2 is online

--------------------

I tested on pacemaker-2.1.0-8.el8 and corosync-3.1.5-1.el8 using what I think is an analogous setup. The behavior at the pacemaker level is the same: the rebooted node doesn't get fenced.

But the corosync behavior seems like it may have changed/improved: the new membership that includes the rebooted node forms sooner instead of waiting for the consensus timeout on the surviving node.
  - On the lanpcmk cluster, the consensus timeout had to elapse on node 1 before node 1 recognized that node 2 had rejoined.
  - On my test cluster, the consensus timeout would have expired on the surviving node at 14:02:05. The fenced node's rejoin seems to interrupt the consensus delay sooner than that (14:01:55), and caused a new membership to form that included both nodes.

I set my corosync token timeout to 20000 (letting consensus default to 24000). I fast-rebooted node 1 at 14:01:24. Node 1 was never scheduled to be fenced.

[root@fastvm-rhel-8-0-23 ~]# date && reboot -f
Wed Feb  2 14:01:24 PST 2022

[root@fastvm-rhel-8-0-24 ~]# tail -f /var/log/messages
...
Feb  2 14:01:28 fastvm-rhel-8-0-24 corosync[373325]:  [KNET  ] link: host: 1 link: 0 is down
Feb  2 14:01:28 fastvm-rhel-8-0-24 corosync[373325]:  [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb  2 14:01:28 fastvm-rhel-8-0-24 corosync[373325]:  [KNET  ] host: host: 1 has no active links
Feb  2 14:01:36 fastvm-rhel-8-0-24 corosync[373325]:  [TOTEM ] Token has not been received in 15000 ms
Feb  2 14:01:41 fastvm-rhel-8-0-24 corosync[373325]:  [TOTEM ] A processor failed, forming new configuration: token timed out (20000ms), waiting 24000ms for consensus.
Feb  2 14:01:52 fastvm-rhel-8-0-24 corosync[373325]:  [KNET  ] rx: host: 1 link: 0 is up
Feb  2 14:01:52 fastvm-rhel-8-0-24 corosync[373325]:  [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb  2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]:  [QUORUM] Sync members[2]: 1 2
Feb  2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]:  [QUORUM] Sync joined[1]: 1
Feb  2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]:  [QUORUM] Sync left[1]: 1
Feb  2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]:  [TOTEM ] A new membership (1.12320) was formed. Members joined: 1 left: 1
Feb  2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]:  [TOTEM ] Failed to receive the leave message. failed: 1
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-attrd[373343]: notice: Node node1 state is now lost
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-attrd[373343]: notice: Removing all node1 attributes for peer loss
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-attrd[373343]: notice: Purged 1 peer with id=1 and/or uname=node1 from the membership cache
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-fenced[373341]: notice: Node node1 state is now lost
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-fenced[373341]: notice: Purged 1 peer with id=1 and/or uname=node1 from the membership cache
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-based[373340]: notice: Node node1 state is now lost
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-based[373340]: notice: Purged 1 peer with id=1 and/or uname=node1 from the membership cache
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-controld[373345]: warning: Stonith/shutdown of node node1 was not expected
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-controld[373345]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Feb  2 14:01:55 fastvm-rhel-8-0-24 pacemaker-attrd[373343]: notice: Node node1 state is now member
Feb  2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]:  [QUORUM] Members[2]: 1 2
Feb  2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]:  [MAIN  ] Completed service synchronization, ready to provide service.


Node 1's logs:
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
...
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [QUORUM] Sync members[1]: 1
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [QUORUM] Sync joined[1]: 1
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [TOTEM ] A new membership (1.1231c) was formed. Members joined: 1
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [QUORUM] Members[1]: 1
Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync info    [KNET  ] rx: host: 2 link: 0 is up
Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice  [QUORUM] Sync members[2]: 1 2
Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice  [QUORUM] Sync joined[1]: 2
Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice  [TOTEM ] A new membership (1.12320) was formed. Members joined: 2
Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice  [QUORUM] This node is within the primary component and will provide service.
Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice  [QUORUM] Members[2]: 1 2
Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

-----

You should be able to avoid the issue in the meantime by disabling the cluster from starting on boot (or adding a delay), so that a node doesn't automatically rejoin quickly after fencing. We'll discuss further how to approach the behavior observed above.

Comment 5 Reid Wahl 2022-02-02 23:53:31 UTC
(In reply to Reid Wahl from comment #4)
> We'll discuss further how to approach the behavior observed above.

Possibly by tracking daemon uptime in pacemaker as we've discussed in the past, so that we don't allow a node to rejoin (or we fence it) if its daemons started after the last time we heard from it.

Comment 6 Ken Gaillot 2022-02-03 15:58:04 UTC
This is (mostly) intended behavior. We don't want to fence a node due to a brief network blip, for example. As long as the node rejoins successfully, and probes can establish the current state of all resources, fencing becomes unnecessary.

There are some exceptions that we don't handle currently, where we should really fence no matter what, but I don't think any apply in this case. I don't think it should matter whether the node rebooted -- as long as we can successfully probe it, it doesn't matter what happened before that.

Some users with quickly rebooting VMs put delays in their boot-up process before starting the cluster, to avoid situations like this.

Comment 7 Reid Wahl 2022-02-03 19:25:56 UTC
A quick warning, in lpham's specific case (not especially relevant to the BZ): wait_for_all is set to 0 in corosync.conf, which means that you're vulnerable to a split brain before a new corosync membership containing both nodes forms. The rebooted node can go ahead and become quorate and start running resources that should be controlled by the surviving node. That could lead to problems depending on the resource configuration.


Note You need to log in before you can comment on or make changes to this bug.