Bug 2048811
| Summary: | No fencing action triggered when rebooting a cluster node with fencing agent | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | lpham | ||||||
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | cluster-qe <cluster-qe> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 8.2 | CC: | cluster-maint, gerry, nwahl | ||||||
| Target Milestone: | rc | Keywords: | Triaged | ||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2023-07-03 15:21:33 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
lpham
2022-01-31 21:50:22 UTC
(In reply to lpham from comment #0) Can you provide a sosreport (or a crm_report if that's not possible) from each cluster node? Created attachment 1858725 [details]
crm_report -S for both cluster nodes
The cluster nodes were: lanpcmk-srv-1 and lanpcmk-srv-2
The host that was rebooted: lanpcmk-srv-2
Time of reboot: Feb 2 08:27:00
Created attachment 1858727 [details] crm_report -S for lanpcmk-srv-1 Added the missing file: missing-fence-for-fast-reboot.lanpcmk-srv-1.tar.bz1 Thanks :) Node 2 rebooted, started corosync, and formed a local membership before node 1 formed a new corosync membership that excluded node 2 and requested fencing. I can't remember if we have an existing BZ for this. I know we have a couple for CPG membership-related race conditions. ----- Node 2 was rebooted around 08:27:14. Feb 02 08:27:14 ca(cf_regress1_129)[1834158]: INFO: ca[1834158]: 299: Entering CF resource monitor... Feb 02 08:27:31.559 lanpcmk-srv-2.fyre.ibm.com pacemakerd [1541] (crm_log_init) info: Changed active directory to /var/lib/pacemaker/cores Corosync totem token timeout is 10000, which means consensus defaults to 12000. We finally formed a new corosync membership that excluded node 2 at 08:27:36. Feb 2 08:27:18 lanpcmk-srv-1 corosync[937038]: [KNET ] libknet.h:log_deliver_fn:682 link: host: 2 link: 0 is down Feb 2 08:27:18 lanpcmk-srv-1 corosync[937038]: [KNET ] libknet.h:log_deliver_fn:682 host: host: 2 (passive) best link: 0 (pri: 1) Feb 2 08:27:18 lanpcmk-srv-1 corosync[937038]: [KNET ] libknet.h:log_deliver_fn:677 host: host: 2 has no active links Feb 2 08:27:22 lanpcmk-srv-1 corosync[937038]: [TOTEM ] totemsrp.c:timer_function_orf_token_warning:1730 Token has not been received in 7500 ms Feb 2 08:27:24 lanpcmk-srv-1 corosync[937038]: [TOTEM ] totemsrp.c:timer_function_orf_token_timeout:1746 A processor failed, forming new configuration. Feb 2 08:27:36 lanpcmk-srv-1 corosync[937038]: [TOTEM ] totemsrp.c:memb_state_operational_enter:2096 A new membership (1.ab9) was formed. Members left: 2 Feb 2 08:27:36 lanpcmk-srv-1 corosync[937038]: [TOTEM ] totemsrp.c:memb_state_operational_enter:2101 Failed to receive the leave message. failed: 2 Feb 2 08:27:36 lanpcmk-srv-1 corosync[937038]: [QUORUM] vsf_quorum.c:log_view_list:131 Members[1]: 1 Feb 2 08:27:36 lanpcmk-srv-1 corosync[937038]: [MAIN ] main.c:corosync_sync_completed:296 Completed service synchronization, ready to provide service. Pacemaker started to react. Feb 2 08:27:36 lanpcmk-srv-1 pacemaker-fenced[937206]: notice: Node lanpcmk-srv-2 state is now lost Feb 2 08:27:36 lanpcmk-srv-1 pacemaker-fenced[937206]: notice: Purged 1 peer with id=2 and/or uname=lanpcmk-srv-2 from the membership cache Feb 2 08:27:36 lanpcmk-srv-1 pacemakerd[937199]: notice: Node lanpcmk-srv-2 state is now lost Feb 2 08:27:36 lanpcmk-srv-1 pacemaker-controld[937210]: notice: Node lanpcmk-srv-2 state is now lost Feb 2 08:27:36 lanpcmk-srv-1 pacemaker-controld[937210]: warning: Our DC node (lanpcmk-srv-2) left the cluster ... Feb 2 08:27:36 lanpcmk-srv-1 pacemaker-controld[937210]: notice: State transition S_NOT_DC -> S_ELECTION Feb 2 08:27:36 lanpcmk-srv-1 pacemaker-controld[937210]: notice: State transition S_ELECTION -> S_INTEGRATION Node 2's corosync had already started up after reboot and formed a membership (containing only itself) at 08:27:28. Feb 2 08:27:28 lanpcmk-srv-2 corosync[1057]: [TOTEM ] totemsrp.c:memb_state_operational_enter:2096 A new membership (2.aba) was formed. Members joined: 2 Feb 2 08:27:28 lanpcmk-srv-2 corosync[1057]: [QUORUM] vsf_quorum.c:log_view_list:131 Members[1]: 2 Feb 2 08:27:28 lanpcmk-srv-2 corosync[1057]: [MAIN ] main.c:corosync_sync_completed:296 Completed service synchronization, ready to provide service. So almost immediately after node 1's corosync formed a membership excluding node 2 and pacemaker started to react, corosync formed a new membership that included node 2. Feb 2 08:27:36 lanpcmk-srv-1 corosync[937038]: [KNET ] libknet.h:log_deliver_fn:682 rx: host: 2 link: 0 is up Feb 2 08:27:36 lanpcmk-srv-1 corosync[937038]: [KNET ] libknet.h:log_deliver_fn:682 host: host: 2 (passive) best link: 0 (pri: 1) Feb 2 08:27:37 lanpcmk-srv-1 corosync[937038]: [TOTEM ] totemsrp.c:memb_state_operational_enter:2096 A new membership (1.abe) was formed. Members joined: 2 Feb 2 08:27:37 lanpcmk-srv-1 corosync[937038]: [QUORUM] vsf_quorum.c:log_view_list:131 Members[2]: 1 2 Feb 2 08:27:37 lanpcmk-srv-1 corosync[937038]: [MAIN ] main.c:corosync_sync_completed:296 Completed service synchronization, ready to provide service. Feb 2 08:27:37 lanpcmk-srv-1 pacemakerd[937199]: notice: Node lanpcmk-srv-2 state is now member Looking at pacemaker.log: Feb 02 08:27:36.519 lanpcmk-srv-1.fyre.ibm.com pacemaker-controld [937210] (reap_dead_nodes) warning: Our DC node (lanpcmk-srv-2) left the cluster ... Feb 02 08:27:36.520 lanpcmk-srv-1.fyre.ibm.com pacemaker-controld [937210] (peer_update_callback) info: Node lanpcmk-srv-2 is no longer a peer | DC=<none> old=0x4000000 new=0x0000000 ... Feb 02 08:27:37.356 lanpcmk-srv-1.fyre.ibm.com pacemakerd [937199] (pcmk_cpg_membership) info: Group pacemakerd event 3: lanpcmk-srv-2 (node 2 pid 1541) is member Feb 02 08:27:37.356 lanpcmk-srv-1.fyre.ibm.com pacemakerd [937199] (crm_update_peer_proc) info: pcmk_cpg_membership: Node lanpcmk-srv-2[2] - corosync-cpg is now online Feb 02 08:27:37.627 lanpcmk-srv-1.fyre.ibm.com pacemaker-controld [937210] (crm_update_peer_state_iter) notice: Node lanpcmk-srv-2 state is now member | nodeid=2 previous=lost source=pcmk_quorum_notification Feb 02 08:27:37.628 lanpcmk-srv-1.fyre.ibm.com pacemaker-controld [937210] (join_make_offer) info: Sending join-1 offer to lanpcmk-srv-2 Feb 02 08:27:37.661 lanpcmk-srv-1.fyre.ibm.com pacemaker-based [937205] (cib_perform_op) info: + /cib/status/node_state[@id='2']: @crmd=online, @crm-debug-origin=peer_update_callback ... Feb 02 08:27:38.711 lanpcmk-srv-1.fyre.ibm.com pacemaker-schedulerd[937209] (determine_online_status_fencing) info: Node lanpcmk-srv-2 is active Feb 02 08:27:38.711 lanpcmk-srv-1.fyre.ibm.com pacemaker-schedulerd[937209] (determine_online_status) info: Node lanpcmk-srv-2 is online -------------------- I tested on pacemaker-2.1.0-8.el8 and corosync-3.1.5-1.el8 using what I think is an analogous setup. The behavior at the pacemaker level is the same: the rebooted node doesn't get fenced. But the corosync behavior seems like it may have changed/improved: the new membership that includes the rebooted node forms sooner instead of waiting for the consensus timeout on the surviving node. - On the lanpcmk cluster, the consensus timeout had to elapse on node 1 before node 1 recognized that node 2 had rejoined. - On my test cluster, the consensus timeout would have expired on the surviving node at 14:02:05. The fenced node's rejoin seems to interrupt the consensus delay sooner than that (14:01:55), and caused a new membership to form that included both nodes. I set my corosync token timeout to 20000 (letting consensus default to 24000). I fast-rebooted node 1 at 14:01:24. Node 1 was never scheduled to be fenced. [root@fastvm-rhel-8-0-23 ~]# date && reboot -f Wed Feb 2 14:01:24 PST 2022 [root@fastvm-rhel-8-0-24 ~]# tail -f /var/log/messages ... Feb 2 14:01:28 fastvm-rhel-8-0-24 corosync[373325]: [KNET ] link: host: 1 link: 0 is down Feb 2 14:01:28 fastvm-rhel-8-0-24 corosync[373325]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Feb 2 14:01:28 fastvm-rhel-8-0-24 corosync[373325]: [KNET ] host: host: 1 has no active links Feb 2 14:01:36 fastvm-rhel-8-0-24 corosync[373325]: [TOTEM ] Token has not been received in 15000 ms Feb 2 14:01:41 fastvm-rhel-8-0-24 corosync[373325]: [TOTEM ] A processor failed, forming new configuration: token timed out (20000ms), waiting 24000ms for consensus. Feb 2 14:01:52 fastvm-rhel-8-0-24 corosync[373325]: [KNET ] rx: host: 1 link: 0 is up Feb 2 14:01:52 fastvm-rhel-8-0-24 corosync[373325]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Feb 2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]: [QUORUM] Sync members[2]: 1 2 Feb 2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]: [QUORUM] Sync joined[1]: 1 Feb 2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]: [QUORUM] Sync left[1]: 1 Feb 2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]: [TOTEM ] A new membership (1.12320) was formed. Members joined: 1 left: 1 Feb 2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]: [TOTEM ] Failed to receive the leave message. failed: 1 Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-attrd[373343]: notice: Node node1 state is now lost Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-attrd[373343]: notice: Removing all node1 attributes for peer loss Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-attrd[373343]: notice: Purged 1 peer with id=1 and/or uname=node1 from the membership cache Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-fenced[373341]: notice: Node node1 state is now lost Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-fenced[373341]: notice: Purged 1 peer with id=1 and/or uname=node1 from the membership cache Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-based[373340]: notice: Node node1 state is now lost Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-based[373340]: notice: Purged 1 peer with id=1 and/or uname=node1 from the membership cache Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-controld[373345]: warning: Stonith/shutdown of node node1 was not expected Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-controld[373345]: notice: State transition S_IDLE -> S_POLICY_ENGINE Feb 2 14:01:55 fastvm-rhel-8-0-24 pacemaker-attrd[373343]: notice: Node node1 state is now member Feb 2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]: [QUORUM] Members[2]: 1 2 Feb 2 14:01:55 fastvm-rhel-8-0-24 corosync[373325]: [MAIN ] Completed service synchronization, ready to provide service. Node 1's logs: Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [MAIN ] Corosync Cluster Engine 3.1.5 starting up ... Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [QUORUM] Sync members[1]: 1 Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [QUORUM] Sync joined[1]: 1 Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [TOTEM ] A new membership (1.1231c) was formed. Members joined: 1 Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [QUORUM] Members[1]: 1 Feb 02 14:01:43 [1048] fastvm-rhel-8-0-23 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync info [KNET ] rx: host: 2 link: 0 is up Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice [QUORUM] Sync members[2]: 1 2 Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice [QUORUM] Sync joined[1]: 2 Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice [TOTEM ] A new membership (1.12320) was formed. Members joined: 2 Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice [QUORUM] This node is within the primary component and will provide service. Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice [QUORUM] Members[2]: 1 2 Feb 02 14:01:55 [1048] fastvm-rhel-8-0-23 corosync notice [MAIN ] Completed service synchronization, ready to provide service. ----- You should be able to avoid the issue in the meantime by disabling the cluster from starting on boot (or adding a delay), so that a node doesn't automatically rejoin quickly after fencing. We'll discuss further how to approach the behavior observed above. (In reply to Reid Wahl from comment #4) > We'll discuss further how to approach the behavior observed above. Possibly by tracking daemon uptime in pacemaker as we've discussed in the past, so that we don't allow a node to rejoin (or we fence it) if its daemons started after the last time we heard from it. This is (mostly) intended behavior. We don't want to fence a node due to a brief network blip, for example. As long as the node rejoins successfully, and probes can establish the current state of all resources, fencing becomes unnecessary. There are some exceptions that we don't handle currently, where we should really fence no matter what, but I don't think any apply in this case. I don't think it should matter whether the node rebooted -- as long as we can successfully probe it, it doesn't matter what happened before that. Some users with quickly rebooting VMs put delays in their boot-up process before starting the cluster, to avoid situations like this. A quick warning, in lpham's specific case (not especially relevant to the BZ): wait_for_all is set to 0 in corosync.conf, which means that you're vulnerable to a split brain before a new corosync membership containing both nodes forms. The rebooted node can go ahead and become quorate and start running resources that should be controlled by the surviving node. That could lead to problems depending on the resource configuration. |