Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2122373

Summary:

Corosync/Pacemaker shuts down after dual reboot of both nodes

Product:

Red Hat Enterprise Linux 8

Reporter:

alex <alex.zarifoglu>

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED DUPLICATE

QA Contact:

cluster-qe <cluster-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

8.4

CC:

ccaulfie, cluster-maint, nwahl

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-08-30 19:43:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
crm_report	none

Description alex 2022-08-29 21:26:36 UTC

Created attachment 1908393 [details]
crm_report

Description of problem:

The issue may be similar to:
https://bugzilla.redhat.com/show_bug.cgi?id=1956687

We have a setup that has 2 nodes and uses diskless SBD + qdevice for fencing.
During a dual reboot test, sometimes we hit an issue where one of the hosts end up with Pacemaker shutdown.

Aug 29 13:06:26 db2svtmf-srv-2 systemd[1]: Started Corosync Cluster Engine.
Aug 29 13:06:26 db2svtmf-srv-2 systemd[1]: Starting Corosync Qdevice daemon...


Aug 29 13:06:35 db2svtmf-srv-2 corosync[1145]:  [QUORUM] vsf_quorum.c:log_view_list:160 Sync members[2]: 1 2
Aug 29 13:06:35 db2svtmf-srv-2 corosync[1145]:  [QUORUM] vsf_quorum.c:log_view_list:160 Sync joined[1]: 1
Aug 29 13:06:35 db2svtmf-srv-2 corosync[1145]:  [VOTEQ ] votequorum.c:votequorum_sync_init:2441 waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Aug 29 13:06:35 db2svtmf-srv-2 corosync[1145]:  [TOTEM ] totemsrp.c:memb_state_operational_enter:2138 A new membership (1.2d4) was formed. Members joined: 1
Aug 29 13:06:35 db2svtmf-srv-2 pacemaker-attrd[1460]: notice: Node db2svtmf-srv-1 state is now member
Aug 29 13:06:35 db2svtmf-srv-2 pacemaker-fenced[1458]: notice: Node db2svtmf-srv-1 state is now member
Aug 29 13:06:35 db2svtmf-srv-2 corosync[1145]:  [QUORUM] vsf_quorum.c:quorum_api_set_quorum:176 This node is within the primary component and will provide service.
Aug 29 13:06:35 db2svtmf-srv-2 corosync[1145]:  [QUORUM] vsf_quorum.c:log_view_list:160 Members[2]: 1 2
Aug 29 13:06:35 db2svtmf-srv-2 corosync[1145]:  [MAIN  ] main.c:corosync_sync_completed:304 Completed service synchronization, ready to provide service.
Aug 29 13:06:35 db2svtmf-srv-2 pacemaker-controld[1462]: notice: Quorum acquired
Aug 29 13:06:35 db2svtmf-srv-2 pacemaker-controld[1462]: notice: Node db2svtmf-srv-1 state is now member
Aug 29 13:06:35 db2svtmf-srv-2 pacemaker-based[1457]: notice: Node db2svtmf-srv-1 state is now member
Aug 29 13:06:36 db2svtmf-srv-2 pacemaker-attrd[1460]: notice: Detected another attribute writer (db2svtmf-srv-1), starting new election
Aug 29 13:06:36 db2svtmf-srv-2 pacemaker-attrd[1460]: notice: Setting #attrd-protocol[db2svtmf-srv-1]: (unset) -> 3
Aug 29 13:06:36 db2svtmf-srv-2 pacemaker-attrd[1460]: notice: Setting db2_svtdbm_0[db2svtmf-srv-1]: (unset) -> 0
Aug 29 13:06:36 db2svtmf-srv-2 pacemaker-attrd[1460]: notice: Setting db2fs_db2hamf_dev_disk_by-uuid_81304ea2-9c4a-4a90-9df7-e78af006e031[db2svtmf-srv-1]: (unset) -> 0
Aug 29 13:06:36 db2svtmf-srv-2 pacemaker-attrd[1460]: notice: Setting db2fs_db2hamf_db_dev_disk_by-uuid_f3228c4f-1dba-4589-9081-6233b338ae98[db2svtmf-srv-1]: (unset) -> 0
Aug 29 13:06:36 db2svtmf-srv-2 systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Aug 29 13:06:51 db2svtmf-srv-2 corosync[1145]:  [KNET  ] libknet.h:log_deliver_fn:765 pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Aug 29 13:06:51 db2svtmf-srv-2 corosync[1145]:  [KNET  ] libknet.h:log_deliver_fn:765 pmtud: Global data MTU changed to: 1397
Aug 29 13:06:54 db2svtmf-srv-2 systemd[1]: systemd-hostnamed.service: Succeeded.
Aug 29 13:07:02 db2svtmf-srv-2 pacemaker-controld[1462]: notice: State transition S_PENDING -> S_NOT_DC
Aug 29 13:07:02 db2svtmf-srv-2 pacemaker-controld[1462]: notice: State transition S_NOT_DC -> S_PENDING
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-fenced[1458]: notice: Operation 'reboot' targeting db2svtmf-srv-2 by db2svtmf-srv-1 for pacemaker-controld.1470@db2svtmf-srv-1: OK
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-fenced[1458]: warning: Missing request information for client notifications for operation with result 0 (initiated before we came up?)
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-controld[1462]: crit: We were allegedly just fenced by db2svtmf-srv-1 for db2svtmf-srv-1!
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: warning: Shutting cluster down because pacemaker-controld[1462] had fatal failure
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: notice: Shutting down Pacemaker
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: notice: Stopping pacemaker-schedulerd
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-schedulerd[1461]: notice: Caught 'Terminated' signal
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: notice: Stopping pacemaker-attrd
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-attrd[1460]: notice: Caught 'Terminated' signal
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-based[1457]: warning: Could not notify client crmd: Broken pipe
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: notice: Stopping pacemaker-execd
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-execd[1459]: notice: Caught 'Terminated' signal
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: notice: Stopping pacemaker-fenced
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-fenced[1458]: notice: Caught 'Terminated' signal
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: notice: Stopping pacemaker-based
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-based[1457]: notice: Caught 'Terminated' signal
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-based[1457]: notice: Disconnected from Corosync
Aug 29 13:07:22 db2svtmf-srv-2 pacemaker-based[1457]: notice: Disconnected from Corosync
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: notice: Shutdown complete
Aug 29 13:07:22 db2svtmf-srv-2 pacemakerd[1451]: notice: Shutting down and staying down after fatal error
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [CFG   ] cfg.c:message_handler_req_exec_cfg_shutdown:580 Node 2 was shut down by sysadmin
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [SERV  ] service.c:service_exit_schedwrk_handler:373 Unloading all Corosync service engines.
Aug 29 13:07:22 db2svtmf-srv-2 systemd[1]: pacemaker.service: Succeeded.
Aug 29 13:07:22 db2svtmf-srv-2 corosync-qdevice[1370]: Can't dispatch votequorum messages
Aug 29 13:07:22 db2svtmf-srv-2 corosync-qdevice[1370]: Can't call votequorum_qdevice_poll. Error CS_ERR_LIBRARY
Aug 29 13:07:22 db2svtmf-srv-2 corosync-qdevice[1370]: qdevice_model_net_run fatal error.  Can't update cast vote timer vote
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [QB    ] ipc_setup.c:qb_ipcs_us_withdraw:595 withdrawing server sockets
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [SERV  ] service.c:corosync_service_unlink_and_exit_priority:240 Service engine unloaded: corosync vote quorum service v1.0
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [QB    ] ipc_setup.c:qb_ipcs_us_withdraw:595 withdrawing server sockets
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [SERV  ] service.c:corosync_service_unlink_and_exit_priority:240 Service engine unloaded: corosync configuration map access
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [QB    ] ipc_setup.c:qb_ipcs_us_withdraw:595 withdrawing server sockets
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [SERV  ] service.c:corosync_service_unlink_and_exit_priority:240 Service engine unloaded: corosync configuration service
Aug 29 13:07:22 db2svtmf-srv-2 sbd[1258]:   cluster:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [QB    ] ipc_setup.c:qb_ipcs_us_withdraw:595 withdrawing server sockets
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [SERV  ] service.c:corosync_service_unlink_and_exit_priority:240 Service engine unloaded: corosync cluster closed process group service v1.01
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [QB    ] ipc_setup.c:qb_ipcs_us_withdraw:595 withdrawing server sockets
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [SERV  ] service.c:corosync_service_unlink_and_exit_priority:240 Service engine unloaded: corosync cluster quorum service v0.1
Aug 29 13:07:22 db2svtmf-srv-2 corosync[1145]:  [SERV  ] service.c:corosync_service_unlink_and_exit_priority:240 Service engine unloaded: corosync profile loading service
Aug 29 13:07:22 db2svtmf-srv-2 sbd[1258]:   cluster:  warning: sbd_membership_destroy: Lost connection to corosync
Aug 29 13:07:22 db2svtmf-srv-2 sbd[1258]:   cluster:    error: set_servant_health: Cluster connection terminated
Aug 29 13:07:22 db2svtmf-srv-2 sbd[1258]:   cluster:  warning: verify_against_cmap_config: Cannot initialize CMAP service
Aug 29 13:07:22 db2svtmf-srv-2 sbd[1246]: warning: inquisitor_child: cluster health check: UNHEALTHY
Aug 29 13:07:22 db2svtmf-srv-2 sbd[1246]: warning: inquisitor_child: Servant cluster is outdated (age: 59)
Aug 29 13:07:23 db2svtmf-srv-2 corosync[1145]:  [MAIN  ] util.c:_corosync_exit_error:133 Corosync Cluster Engine exiting normally
Aug 29 13:07:23 db2svtmf-srv-2 systemd[1]: corosync.service: Control process exited, code=exited status=1
Aug 29 13:07:23 db2svtmf-srv-2 systemd[1]: corosync.service: Failed with result 'exit-code'.
Aug 29 13:07:23 db2svtmf-srv-2 sbd[1258]:   cluster:  warning: verify_against_cmap_config: Cannot initialize CMAP service
Aug 29 13:07:24 db2svtmf-srv-2 corosync-qdevice[1370]: Can't delete cmap totemconfig_reload_in_progress tracking
Aug 29 13:07:24 db2svtmf-srv-2 corosync-qdevice[1370]: Can't delete cmap nodelist tracking
Aug 29 13:07:24 db2svtmf-srv-2 corosync-qdevice[1370]: Can't delete cmap logging tracking
Aug 29 13:07:24 db2svtmf-srv-2 corosync-qdevice[1370]: Can't delete cmap heuristics tracking
Aug 29 13:07:24 db2svtmf-srv-2 corosync-qdevice[1370]: Can't stop tracking votequorum changes. Error CS_ERR_LIBRARY
Aug 29 13:07:24 db2svtmf-srv-2 corosync-qdevice[1370]: Unable to unregister votequorum device. Error CS_ERR_LIBRARY
Aug 29 13:07:24 db2svtmf-srv-2 systemd[1]: corosync-qdevice.service: Main process exited, code=exited, status=1/FAILURE
Aug 29 13:07:24 db2svtmf-srv-2 systemd[1]: corosync-qdevice.service: Failed with result 'exit-code'.
Aug 29 13:07:24 db2svtmf-srv-2 sbd[1258]:   cluster:  warning: verify_against_cmap_config: Cannot initialize CMAP service
Aug 29 13:07:24 db2svtmf-srv-2 systemd[1]: corosync-qdevice.service: Service RestartSec=100ms expired, scheduling restart.
Aug 29 13:07:24 db2svtmf-srv-2 systemd[1]: corosync-qdevice.service: Scheduled restart job, restart counter is at 1.
Aug 29 13:07:24 db2svtmf-srv-2 systemd[1]: Stopped Corosync Qdevice daemon.
Aug 29 13:07:24 db2svtmf-srv-2 systemd[1]: Starting Corosync Cluster Engine...


Version-Release number of selected component (if applicable):

corosync-3.1.6-2
pacemaker-2.1.2-4

How reproducible:
If both hosts rebooted & came back at the same time, it seems that we are more likely to hit the issue.

Steps to Reproduce:
1. 2 node + qdevice host setup with sbd
reboot both hosts at the same time.
2.
3.

Actual results:
Corosync restarts on one of the nodes and Pacemaker shuts down and stays down.

Expected results:
After dual reboot, both hosts should either recover or node with the higher node id gets fenced/rebooted again after the reboot and everything recovers.


Additional info:

Comment 1 Ken Gaillot 2022-08-30 19:43:18 UTC

This is a known timing issue when a fenced node rejoins the cluster before the fence agent reports success. The workaround is to add a delay at boot before starting Corosync, e.g.

    # systemctl edit corosync.service
    [Service]
    ExecStartPre=/bin/sleep 60
    # systemctl daemon-reload

The problematic sequence is the node gets fenced, then reboots and rejoins the cluster before the fence agent has reported success, so it gets the notification of its own fencing. To be safe, Pacemaker assumes something went wrong with the fencing, and stops on the node.

*** This bug has been marked as a duplicate of bug 1956687 ***