Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1947731

Summary:

After crashing the node cluster doesn't restart on that node once its up.

Product:

Red Hat Enterprise Linux 8

Reporter:

sreekantha <sreekantha>

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED DUPLICATE

QA Contact:

cluster-qe <cluster-qe>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

8.0

CC:

agk, cfeist, cluster-maint, fdinitto, rraghotham, sdivya, tutikas

Target Milestone:

Keywords:

Triaged

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: A Pacemaker node that was fenced can reboot and rejoin the cluster before the fence agent returns success for the fencing. Consequence: The node receives its own fencing notification and immediately stops Pacemaker. Fix: Result:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-13 15:30:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
message logs of the node crashed	none
pcsd-log	none
corosync log	none

Description sreekantha 2021-04-09 04:10:35 UTC

Created attachment 1770487 [details]
message logs of the node crashed

Description of problem:
After crashing the node cluster doesn't restart on that node once its up.

On a 5 Node RHEL79 cluster, when cluster up and IOs happening, crashed the active node. Cluster got failedover to other node, but cluster is not running on that node after the node came up, pcs status shows "Error: cluster is not currently running on this node".


Version-Release number of selected component (if applicable):
[root@rhel79node3 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)

however, this is an intermittent issue seen only on RHEL7.9. we never observed this issue on RHEL8.2
 
primary VM:rhel79node3
After failover ative node : rhel79node1
TimeStamp crashing the rhel79node3 : Sun Feb 28 21:43:08 PST 2021
============================================================
[root@rhel79node3 ~]# date ; sysctl kernel.panic=10; echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger
Sun Feb 28 21:43:08 PST 2021
login as: root
root.74.18's password:
Last login: Sun Feb 28 21:31:21 2021 from 10.104.2.231
ABRT has detected 1 problem(s). For more info run: abrt-cli list --since 1614576681
[root@rhel79node3 ~]# pcs status
Error: cluster is not currently running on this node
[root@rhel79node3 ~]# pcs status
Error: cluster is not currently running on this node
[root@rhel79node3 ~]#

==================================================================================================
PCS status after the Crash:

[root@rhel79node1 ~]# pcs status
Cluster name: Redhat79-cls
Stack: corosync
Current DC: rhel79node2 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Sun Feb 28 21:45:54 2021
Last change: Sat Feb 27 00:25:11 2021 by root via cibadmin on rhel79node3

5 nodes configured
7 resource instances configured

Online: [ rhel79node1 rhel79node2 rhel79node4 rhel79node5 ]
OFFLINE: [ rhel79node3 ]

Full list of resources:

 rhel79fence    (stonith:fence_scsi):   Started rhel79node2
 Resource Group: SQL_Cluster
     my_lvm     (ocf::heartbeat:LVM-activate):  Started rhel79node1
     my_fs      (ocf::heartbeat:Filesystem):    Started rhel79node1
     my_lvm2    (ocf::heartbeat:LVM-activate):  Started rhel79node1
     my_fs2     (ocf::heartbeat:Filesystem):    Started rhel79node1
     virtualIP  (ocf::heartbeat:IPaddr2):       Started rhel79node1
     RHEL79SQLFCI       (ocf::mssql:fci):       Started rhel79node1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@rhel79node1 ~]#
====================================================================================================

AFTER the CRASH: pcsd , corosync and pacemaker status

[root@rhel79node3 ~]# service pcsd status
Redirecting to /bin/systemctl status pcsd.service
● pcsd.service - PCS GUI and remote configuration interface
   Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2021-02-28 21:43:42 PST; 42min ago
     Docs: man:pcsd(8)
           man:pcs(8)
 Main PID: 3388 (pcsd)
    Tasks: 6
   CGroup: /system.slice/pcsd.service
           └─3388 /usr/bin/ruby /usr/lib/pcsd/pcsd

Feb 28 21:43:40 rhel79node3 systemd[1]: Starting PCS GUI and remote configuration interface...
Feb 28 21:43:42 rhel79node3 systemd[1]: Started PCS GUI and remote configuration interface.
[root@rhel79node3 ~]#


[root@rhel79node3 ~]# service corosync status
Redirecting to /bin/systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Sun 2021-02-28 21:43:45 PST; 42min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 4265 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS)
  Process: 3390 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
 Main PID: 3465 (code=exited, status=0/SUCCESS)

Feb 28 21:43:45 rhel79node3 corosync[3465]:  [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [SERV  ] Service engine unloaded: corosync configuration map access
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [SERV  ] Service engine unloaded: corosync configuration service
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [SERV  ] Service engine unloaded: corosync profile loading service
Feb 28 21:43:45 rhel79node3 corosync[3465]:  [MAIN  ] Corosync Cluster Engine exiting normally
[root@rhel79node3 ~]#


[root@rhel79node3 ~]# service pacemaker status
Redirecting to /bin/systemctl status pacemaker.service
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Sun 2021-02-28 21:43:45 PST; 42min ago
     Docs: man:pacemakerd
           https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html
  Process: 3591 ExecStart=/usr/sbin/pacemakerd -f (code=exited, status=100)
 Main PID: 3591 (code=exited, status=100)

Feb 28 21:43:45 rhel79node3 attrd[3663]:   notice: Caught 'Terminated' signal
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:   notice: Stopping lrmd
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:   notice: Stopping stonith-ng
Feb 28 21:43:45 rhel79node3 stonith-ng[3661]:   notice: Caught 'Terminated' signal
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:   notice: Stopping cib
Feb 28 21:43:45 rhel79node3 cib[3659]:   notice: Caught 'Terminated' signal
Feb 28 21:43:45 rhel79node3 cib[3659]:   notice: Disconnected from Corosync
Feb 28 21:43:45 rhel79node3 cib[3659]:   notice: Disconnected from Corosync
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:   notice: Shutdown complete
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:   notice: Attempting to inhibit respawning after fatal error
[root@rhel79node3 ~]#

Comment 2 sreekantha 2021-04-09 04:11:49 UTC

<<var log message snip>>
Feb 28 21:43:26 rhel79node3 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.15.2.el7.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto spectre_v2=retpoline rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet LANG=en_US.UTF-8
Feb 28 21:43:26 rhel79node3 kernel: Disabled fast string operations
Feb 28 21:43:26 rhel79node3 kernel: e820: BIOS-provided physical RAM map:
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009f3ff] usable
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x000000000009f400-0x000000000009ffff] reserved
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x00000000000dc000-0x00000000000fffff] reserved
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bfedffff] usable
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x00000000bfee0000-0x00000000bfefefff] ACPI data
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x00000000bfeff000-0x00000000bfefffff] ACPI NVS
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x00000000bff00000-0x00000000bfffffff] usable
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x00000000fffe0000-0x00000000ffffffff] reserved
Feb 28 21:43:26 rhel79node3 kernel: BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable
Feb 28 21:43:26 rhel79node3 kernel: NX (Execute Disable) protection: active
Feb 28 21:43:26 rhel79node3 kernel: SMBIOS 2.7 present.
Feb 28 21:43:26 rhel79node3 kernel: DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Feb 28 21:43:26 rhel79node3 kernel: Hypervisor detected: VMware
Feb 28 21:43:26 rhel79node3 kernel: vmware: TSC freq read from hypervisor : 2294.609 MHz
Feb 28 21:43:26 rhel79node3 kernel: vmware: Host bus clock speed read from hypervisor : 66000000 Hz
Feb 28 21:43:26 rhel79node3 kernel: vmware: using sched offset of 9246335634 ns
Feb 28 21:43:26 rhel79node3 kernel: e820: last_pfn = 0x140000 max_arch_pfn = 0x400000000
Feb 28 21:43:26 rhel79node3 kernel: PAT configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- UC
Feb 28 21:43:26 rhel79node3 kernel: total RAM covered: 7168M
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Feb 28 21:43:26 rhel79node3 kernel: acpiphp: Slot [63] registered
Feb 28 21:43:26 rhel79node3 kernel: pci 0000:02:01.0: System wakeup disabled by ACPI
Feb 28 21:43:26 rhel79node3 kernel: pci 0000:00:11.0: PCI bridge to [bus 02] (subtractive decode)
Feb 28 21:43:26 rhel79node3 kernel: pci 0000:03:00.0: 128.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x32 link at 0000:00:15.0 (capable of 252.032 Gb/s with 8 GT/s x32 link)
Fe
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Feb 28 21:43:32 rhel79node3 kernel: sd 2:0:62:0: Attached scsi generic sg62 type 0
Feb 28 21:43:32 rhel79node3 kernel: sr 4:0:0:0: Attached scsi generic sg63 type 5
Feb 28 21:43:32 rhel79node3 systemd: Created slice system-lvm2\x2dpvscan.slice.
Feb 28 21:43:32 rhel79node3 systemd: Starting LVM2 PV scan on device 8:32...
Feb 28 21:43:32 rhel79node3 systemd: Starting LVM2 PV scan on device 65:16...
Feb 28 21:43:32 rhel79node3 systemd: Starting LVM2 PV scan on device 8:240...
Feb 28 21:43:32 rhel79node3 systemd: Starting LVM2 PV scan on device 8:192...
Feb 28 21:43:32 rhel79node3 systemd: Starting LVM2 PV scan on device 65:32...
::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::
Feb 28 21:43:33 rhel79node3 kernel: XFS (sda1): Ending recovery (logdev: internal)
Feb 28 21:43:33 rhel79node3 systemd: Mounted /boot.
Feb 28 21:43:33 rhel79node3 lvm: pvscan[2346] VG my_vg run autoactivation.
Feb 28 21:43:33 rhel79node3 lvm: pvscan[2330] VG my_vg2 run autoactivation.
Feb 28 21:43:33 rhel79node3 lvm: 1 logical volume(s) in volume group "my_vg" now active
Feb 28 21:43:33 rhel79node3 systemd: Started LVM2 PV scan on device 65:128.
Feb 28 21:43:33 rhel79node3 lvm: pvscan[2348] VG my_vg skip autoactivation.
Feb 28 21:43:33 rhel79node3 systemd: Started LVM2 PV scan on device 65:112.
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Feb 28 21:43:37 rhel79node3 smartd[2773]: Device: /dev/sdd, [DGC      VRAID            5006], lu id: 0x6006016080104600073425602cce8583, S/N: APM00181818886, 5.36 GB
Feb 28 21:43:37 rhel79node3 dbus[2733]: [system] Activating via systemd: service name='org.freedesktop.PolicyKit1' unit='polkit.service'
Feb 28 21:43:37 rhel79node3 smartd[2773]: Device: /dev/sdd, IE (SMART) not enabled, skip device
Feb 28 21:43:37 rhel79node3 smartd[2773]: Try 'smartctl -s on /dev/sdd' to turn on SMART features
Feb 28 21:43:37 rhel79node3 smartd[2773]: Device: /dev/sde, opened
Feb 28 21:43:37 rhel79node3 smartd[2773]: Device: /dev/sde, [DGC      VRAID            5006], lu id: 0x60060160801046000834256027a80e43, S/N: APM00181818886, 5.36 GB
Feb 28 21:43:37 rhel79node3 smartd[2773]: Device: /dev/sde, IE (SMART) not enabled, skip device
Feb 28 21:43:37 rhel79node3 smartd[2773]: Try 'smartctl -s on /dev/sde' to turn on SMART features
Feb 28 21:43:37 rhel79node3 smartd[2773]: Device: /dev/sdf, opened
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Feb 28 21:43:40 rhel79node3 corosync[3412]: [MAIN  ] Corosync Cluster Engine ('2.4.5'): started and ready to provide service.
Feb 28 21:43:40 rhel79node3 corosync[3412]: [MAIN  ] Corosync built-in features: dbus systemd xmlconf qdevices qnetd snmp libcgroup pie relro bindnow
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Feb 28 21:43:41 rhel79node3 systemd: Started GNOME Display Manager.
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] The network interface [192.168.10.43] is now up.
Feb 28 21:43:41 rhel79node3 corosync[3465]: [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QB    ] server name: cmap
Feb 28 21:43:41 rhel79node3 corosync[3465]: [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QB    ] server name: cfg
Feb 28 21:43:41 rhel79node3 corosync[3465]: [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QB    ] server name: cpg
Feb 28 21:43:41 rhel79node3 corosync[3465]: [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QUORUM] Using quorum provider corosync_votequorum
Feb 28 21:43:41 rhel79node3 corosync[3465]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QB    ] server name: votequorum
Feb 28 21:43:41 rhel79node3 corosync[3465]: [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QB    ] server name: quorum
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.10.41}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.10.42}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.10.43}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.10.44}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.10.45}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] The network interface [192.168.20.43] is now up.
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.20.41}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.20.42}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.20.43}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.20.44}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] adding new UDPU member {192.168.20.45}
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] A new membership (192.168.10.43:14936) was formed. Members joined: 3
Feb 28 21:43:41 rhel79node3 corosync[3465]: [CPG   ] downlist left_list: 0 received
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QUORUM] Members[1]: 3
Feb 28 21:43:41 rhel79node3 corosync[3465]: [MAIN  ] Completed service synchronization, ready to provide service.
Feb 28 21:43:41 rhel79node3 corosync[3465]: [TOTEM ] A new membership (192.168.10.41:14940) was formed. Members joined: 1 2 4 5
Feb 28 21:43:41 rhel79node3 corosync[3465]: [CPG   ] downlist left_list: 1 received
Feb 28 21:43:41 rhel79node3 corosync[3465]: [CPG   ] downlist left_list: 1 received
Feb 28 21:43:41 rhel79node3 corosync[3465]: [CPG   ] downlist left_list: 1 received
Feb 28 21:43:41 rhel79node3 corosync[3465]: [CPG   ] downlist left_list: 1 received
Feb 28 21:43:41 rhel79node3 corosync[3465]: [CPG   ] downlist left_list: 0 received
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QUORUM] This node is within the primary component and will provide service.
Feb 28 21:43:41 rhel79node3 corosync[3465]: [QUORUM] Members[5]: 1 2 3 4 5
Feb 28 21:43:41 rhel79node3 corosync[3465]: [MAIN  ] Completed service synchronization, ready to provide service.
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Feb 28 21:43:42 rhel79node3 libvirtd: 2021-03-01 05:43:42.040+0000: 3806: info : libvirt version: 4.5.0, package: 36.el7_9.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-10-20-17:39:55, x86-vm-25.build.eng.bos.redhat.com)
Feb 28 21:43:42 rhel79node3 libvirtd: 2021-03-01 05:43:42.040+0000: 3806: info : hostname: rhel79node3
Feb 28 21:43:42 rhel79node3 libvirtd: 2021-03-01 05:43:42.040+0000: 3806: error : virHostCPUGetTscInfo:1389 : Unable to open /dev/kvm: No such file or directory
Feb 28 21:43:42 rhel79node3 libvirtd: 2021-03-01 05:43:42.069+0000: 3806: error : virHostCPUGetTscInfo:1389 : Unable to open /dev/kvm: No such file or directory
Feb 28 21:43:42 rhel79node3 libvirtd: 2021-03-01 05:43:42.070+0000: 3806: error : virHostCPUGetTscInfo:1389 : Unable to open /dev/kvm: No such file or directory
Feb 28 21:43:42 rhel79node3 NetworkManager[2878]: <info>  [1614577422.0764] manager: (virbr0): new Bridge device (/org/freedesktop/NetworkManager/Devices/5)
Feb 28 21:43:42 rhel79node3 kernel: tun: Universal TUN/TAP device driver, 1.6
Feb 28 21:43:42 rhel79node3 kernel: tun: (C) 1999-2004 Max Krasnyansky <maxk>
Feb 28
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]: warning: Shutting cluster down because crmd[3667] had fatal failure
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:  notice: Shutting down Pacemaker
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:  notice: Stopping pengine
Feb 28 21:43:45 rhel79node3 pengine[3666]:  notice: Caught 'Terminated' signal
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:  notice: Stopping attrd
Feb 28 21:43:45 rhel79node3 cib[3659]: warning: new_event_notification (/dev/shm/qb-3659-3667-11-MhToyE/qb): Broken pipe (32)
Feb 28 21:43:45 rhel79node3 cib[3659]: warning: Notification of client crmd/df83c1d0-4b64-444f-a956-bf5c2f15aaf6 failed
Feb 28 21:43:45 rhel79node3 attrd[3663]:  notice: Caught 'Terminated' signal
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:  notice: Stopping lrmd
Feb 28 21:43:45 rhel79node3 lrmd[3662]:  notice: Caught 'Terminated' signal
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:  notice: Stopping stonith-ng
Feb 28 21:43:45 rhel79node3 stonith-ng[3661]:  notice: Caught 'Terminated' signal
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:  notice: Stopping cib
Feb 28 21:43:45 rhel79node3 cib[3659]:  notice: Caught 'Terminated' signal
Feb 28 21:43:45 rhel79node3 cib[3659]:  notice: Disconnected from Corosync
Feb 28 21:43:45 rhel79node3 cib[3659]:  notice: Disconnected from Corosync
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:  notice: Shutdown complete
Feb 28 21:43:45 rhel79node3 pacemakerd[3591]:  notice: Attempting to inhibit respawning after fatal error
Feb 28 21:43:45 rhel79node3 corosync[3465]: [CFG   ] Node 3 was shut down by sysadmin
Feb 28 21:43:45 rhel79node3 corosync[3465]: [SERV  ] Unloading all Corosync service engines.
Feb 28 21:43:45 rhel79node3 corosync[3465]: [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]: [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Feb 28 21:43:45 rhel79node3 corosync[3465]: [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]: [SERV  ] Service engine unloaded: corosync configuration map access
Feb 28 21:43:45 rhel79node3 corosync[3465]: [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]: [SERV  ] Service engine unloaded: corosync configuration service
Feb 28 21:43:45 rhel79node3 corosync[3465]: [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]: [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Feb 28 21:43:45 rhel79node3 corosync[3465]: [QB    ] withdrawing server sockets
Feb 28 21:43:45 rhel79node3 corosync[3465]: [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Feb 28 21:43:45 rhel79node3 corosync[3465]: [SERV  ] Service engine unloaded: corosync profile loading service
Feb 28 21:43:45 rhel79node3 corosync[3465]: [MAIN  ] Corosync Cluster Engine exiting normally
Feb 28 21:43:45 rhel79node3 org.a11y.Bus: Activating service name='org.a11y.atspi.Registry'
Feb 28 21:43:45 rhel79node3 org.a11y.Bus: Successfully activated service 'org.a11y.atspi.Registry'
Feb 28 21:43:45 rhel79node3 org.a11y.atspi.Registry: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
Feb 28 21:43:45 rhel79node3 gnome-session: generating cookie with syscall
Feb 28 21:43:45 rhel79node3 gnome-session: generating cookie with syscall
Feb 28 21:43:45 rhel79node3 gnome-session: generating cookie with syscall
Feb 28 21:43:45 rhel79node3 gnome-session: generating cookie with syscall
Feb 28 21:43:46 rhel79node3 chronyd[2746]: Selected source 10.166.1.120
Feb 28 21:43:46 rhel79node3 systemd: Configuration file /usr/lib/systemd/system/mssql-server.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 28 21:43:46 rhel79node3 dbus[2733]: [system] Activating via systemd: service name='org.freedesktop.UPower' unit='upower.service'
Feb 28 21:43:46 rhel79node3 systemd: Starting Daemon for power management...
Feb 28 21:43:46 rhel79node3 dbus[2733]: [system] Successfully activated service 'org.freedesktop.UPower'
Feb 28 21:43:46 rhel79node3 systemd: Started Daemon for power management.

Comment 3 sreekantha 2021-04-09 04:12:36 UTC

<<corosync.log>>
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:  warning: pcmk_child_exit:       Shutting cluster down because crmd[3667] had fatal failure
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:   notice: pcmk_shutdown_worker:  Shutting down Pacemaker
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:   notice: stop_child:    Stopping pengine | sent signal 15 to process 3666
Feb 28 21:43:45 [3666] rhel79node3    pengine:   notice: crm_signal_dispatch:   Caught 'Terminated' signal | 15 (invoking handler)
Feb 28 21:43:45 [3666] rhel79node3    pengine:     info: qb_ipcs_us_withdraw:   withdrawing server sockets
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_perform_op:        Diff: --- 0.967.15 2
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_perform_op:        Diff: +++ 0.967.16 (null)
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_perform_op:        -- /cib/status/node_state[@id='3']/lrm[@id='3']
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_perform_op:        +  /cib:  @num_updates=16
Feb 28 21:43:45 [3666] rhel79node3    pengine:     info: crm_xml_cleanup:       Cleaning up memory from libxml2
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:     info: pcmk_child_exit:       pengine[3666] exited with status 0 (OK)
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:   notice: stop_child:    Stopping attrd | sent signal 15 to process 3663
Feb 28 21:43:45 [3659] rhel79node3        cib:  warning: qb_ipcs_event_sendv:   new_event_notification (/dev/shm/qb-3659-3667-11-MhToyE/qb): Broken pipe (32)
Feb 28 21:43:45 [3659] rhel79node3        cib:  warning: cib_notify_send_one:   Notification of client crmd/df83c1d0-4b64-444f-a956-bf5c2f15aaf6 failed
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_process_request:   Completed cib_delete operation for section //node_state[@uname='rhel79node3']/*: OK (rc=0, origin=rhel79node2/crmd/816, version=0.967.16)
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_process_request:   Completed cib_modify operation for section status: OK (rc=0, origin=rhel79node2/crmd/817, version=0.967.16)
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:     info: mcp_cpg_deliver:       Ignoring process list sent by peer for local node
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:     info: mcp_cpg_deliver:       Ignoring process list sent by peer for local node
Feb 28 21:43:45 [3663] rhel79node3      attrd:   notice: crm_signal_dispatch:   Caught 'Terminated' signal | 15 (invoking handler)
Feb 28 21:43:45 [3663] rhel79node3      attrd:     info: attrd_shutdown:        Shutting down
Feb 28 21:43:45 [3663] rhel79node3      attrd:     info: main:  Shutting down attribute manager
Feb 28 21:43:45 [3663] rhel79node3      attrd:     info: qb_ipcs_us_withdraw:   withdrawing server sockets
Feb 28 21:43:45 [3663] rhel79node3      attrd:     info: attrd_cib_destroy_cb:  Connection disconnection complete
Feb 28 21:43:45 [3663] rhel79node3      attrd:     info: crm_xml_cleanup:       Cleaning up memory from libxml2
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_process_request:   Completed cib_delete operation for section //node_state[@uname='rhel79node3']/*: OK (rc=0, origin=rhel79node2/crmd/818, version=0.967.16)
:::::::::::::::::::::::
:::::::::::::::::::::::
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:     info: mcp_cpg_deliver:       Ignoring process list sent by peer for local node
Feb 28 21:43:45 [3659] rhel79node3        cib:   notice: crm_signal_dispatch:   Caught 'Terminated' signal | 15 (invoking handler)
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_shutdown:  Disconnected 0 clients
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_shutdown:  All clients disconnected (0)
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: initiate_exit: Sending disconnect notification to 5 peers...
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_process_shutdown_req:      Shutdown REQ from rhel79node3
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_process_shutdown_req:      Shutdown ACK from rhel79node3
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: terminate_cib: cib_process_shutdown_req: Exiting from mainloop...
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: crm_cluster_disconnect:        Disconnecting from cluster infrastructure: corosync
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: terminate_cs_connection:       Disconnecting from Corosync
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: terminate_cs_connection:       No Quorum connection
Feb 28 21:43:45 [3659] rhel79node3        cib:   notice: terminate_cs_connection:       Disconnected from Corosync
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: crm_cluster_disconnect:        Disconnected from corosync
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: crm_get_peer:  Created entry 48a43ada-8efe-4c02-b657-77a469ae77dd/0x55aa662b75a0 for node rhel79node3/0 (1 total)
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cib_peer_update_callback:      No more peers
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: terminate_cib: cib_peer_update_callback: Exiting from mainloop...
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: crm_cluster_disconnect:        Disconnecting from cluster infrastructure: corosync
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: terminate_cs_connection:       Disconnecting from Corosync
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: cluster_disconnect_cpg:        No CPG connection
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: terminate_cs_connection:       No Quorum connection
Feb 28 21:43:45 [3659] rhel79node3        cib:   notice: terminate_cs_connection:       Disconnected from Corosync
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: crm_cluster_disconnect:        Disconnected from corosync
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: qb_ipcs_us_withdraw:   withdrawing server sockets
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: qb_ipcs_us_withdraw:   withdrawing server sockets
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: qb_ipcs_us_withdraw:   withdrawing server sockets
Feb 28 21:43:45 [3659] rhel79node3        cib:     info: crm_xml_cleanup:       Cleaning up memory from libxml2
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:     info: pcmk_child_exit:       cib[3659] exited with status 0 (OK)
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:   notice: pcmk_shutdown_worker:  Shutdown complete
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:   notice: pcmk_shutdown_worker:  Attempting to inhibit respawning after fatal error
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:     info: pcmk_exit_with_cluster:        Asking Corosync to shut down
[3412] rhel79node3 corosyncnotice  [CFG   ] Node 3 was shut down by sysadmin
[3412] rhel79node3 corosyncnotice  [SERV  ] Unloading all Corosync service engines.
[3412] rhel79node3 corosyncinfo    [QB    ] withdrawing server sockets
Feb 28 21:43:45 [3591] rhel79node3 pacemakerd:     info: crm_xml_cleanup:       Cleaning up memory from libxml2
[3412] rhel79node3 corosyncnotice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
[3412] rhel79node3 corosyncinfo    [QB    ] withdrawing server sockets
[3412] rhel79node3 corosyncnotice  [SERV  ] Service engine unloaded: corosync configuration map access
[3412] rhel79node3 corosyncinfo    [QB    ] withdrawing server sockets
[3412] rhel79node3 corosyncnotice  [SERV  ] Service engine unloaded: corosync configuration service
[3412] rhel79node3 corosyncinfo    [QB    ] withdrawing server sockets
[3412] rhel79node3 corosyncnotice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
[3412] rhel79node3 corosyncinfo    [QB    ] withdrawing server sockets
[3412] rhel79node3 corosyncnotice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
[3412] rhel79node3 corosyncnotice  [SERV  ] Service engine unloaded: corosync profile loading service
[3412] rhel79node3 corosyncnotice  [MAIN  ] Corosync Cluster Engine exiting normally
=====================================================================================
<<pacemaker.log>>
Feb 28 21:43:41 [3591] rhel79node3 pacemakerd:     info: crm_log_init:  Changed active directory to /var/lib/pacemaker/cores
Feb 28 21:43:41 [3591] rhel79node3 pacemakerd:     info: get_cluster_type:      Detected an active 'corosync' cluster
Feb 28 21:43:41 [3591] rhel79node3 pacemakerd:     info: mcp_read_config:       Reading configure for stack: corosync
Feb 28 21:43:41 [3591] rhel79node3 pacemakerd:   notice: crm_add_logfile:       Switching to /var/log/cluster/corosync.log

++++++++++++++++++++++++++++++++
<<pcsd.log>>

I, [2021-02-28T21:43:42.816543 #3388]  INFO -- : Notifying systemd we are running (socket /run/systemd/notify)
[2021-02-28 21:43:42] INFO  WEBrick::HTTPServer#start: pid=3388 port=2224
I, [2021-02-28T21:43:42.817201 #3388]  INFO -- : Config files sync thread started
I, [2021-02-28T21:43:42.817242 #3388]  INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2021-02-28T21:43:42.817260 #3388]  INFO -- : CIB USER: hacluster, groups:
I, [2021-02-28T21:43:42.831707 #3388]  INFO -- : Return Value: 0
I, [2021-02-28T21:43:42.831804 #3388]  INFO -- : Running: /usr/sbin/pcs status nodes corosync
I, [2021-02-28T21:43:42.831826 #3388]  INFO -- : CIB USER: hacluster, groups:
I, [2021-02-28T21:43:42.980974 #3388]  INFO -- : Return Value: 0
I, [2021-02-28T21:43:42.994857 #3388]  INFO -- : SRWT Node: rhel79node1 Request: get_configs
I, [2021-02-28T21:43:42.996190 #3388]  INFO -- : SRWT Node: rhel79node5 Request: get_configs
I, [2021-02-28T21:43:42.996628 #3388]  INFO -- : SRWT Node: rhel79node2 Request: get_configs
I, [2021-02-28T21:43:42.997001 #3388]  INFO -- : SRWT Node: rhel79node4 Request: get_configs
I, [2021-02-28T21:43:42.997335 #3388]  INFO -- : SRWT Node: rhel79node3 Request: get_configs
I, [2021-02-28T21:43:43.178879 #3388]  INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2021-02-28T21:43:43.178944 #3388]  INFO -- : CIB USER: hacluster, groups:
I, [2021-02-28T21:43:43.185008 #3388]  INFO -- : Return Value: 0
::ffff:192.168.10.43 - - [28/Feb/2021:21:43:43 -0800] "GET /remote/get_configs?cluster_name=Redhat79-cls HTTP/1.1" 200 837 0.0218
::ffff:192.168.10.43 - - [28/Feb/2021:21:43:43 -0800] "GET /remote/get_configs?cluster_name=Redhat79-cls HTTP/1.1" 200 837 0.0219
rhel79node3.eng.vmware.com - - [28/Feb/2021:21:43:43 PST] "GET /remote/get_configs?cluster_name=Redhat79-cls HTTP/1.1" 200 837
- -> /remote/get_configs?cluster_name=Redhat79-cls
I, [2021-02-28T21:43:43.238739 #3388]  INFO -- : Config files sync thread finished
I, [2021-02-28T21:43:46.619156 #3388]  INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2021-02-28T21:43:46.619202 #3388]  INFO -- : CIB USER: hacluster, groups:
I, [2021-02-28T21:43:46.623402 #3388]  INFO -- : Return Value: 1
::ffff:192.168.20.41 - - [28/Feb/2021:21:43:46 -0800] "GET /remote/get_configs?cluster_name=Redhat79-cls HTTP/1.1" 200 837 0.0060
::ffff:192.168.20.41 - - [28/Feb/2021:21:43:46 -0800] "GET /remote/get_configs?cluster_name=Redhat79-cls HTTP/1.1" 200 837 0.0061
rhel79node1.eng.vmware.com - - [28/Feb/2021:21:43:46 PST] "GET /remote/get_configs?cluster_name=Redhat79-cls HTTP/1.1" 200 837
- -> /remote/get_configs?cluster_name=Redhat79-cls

Comment 4 sreekantha 2021-04-09 05:07:17 UTC

<<pcs config o/p>>

[root@rhel79node1 ~]# pcs config
Cluster Name: Redhat79-cls
Corosync Nodes:
 rhel79node1 rhel79node2 rhel79node3 rhel79node4 rhel79node5
Pacemaker Nodes:
 rhel79node1 rhel79node2 rhel79node3 rhel79node4 rhel79node5

Resources:
 Group: SQL_Cluster
  Resource: my_lvm (class=ocf provider=heartbeat type=LVM-activate)
   Attributes: vg_access_mode=system_id vgname=my_vg
   Operations: monitor interval=30s timeout=90s (my_lvm-monitor-interval-30s)
               start interval=0s timeout=90s (my_lvm-start-interval-0s)
               stop interval=0s timeout=90s (my_lvm-stop-interval-0s)
  Resource: my_fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/my_vg/my_lv directory=/var/opt/mssql/data fstype=ext4
   Operations: monitor interval=20s timeout=40s (my_fs-monitor-interval-20s)
               notify interval=0s timeout=60s (my_fs-notify-interval-0s)
               start interval=0s timeout=60s (my_fs-start-interval-0s)
               stop interval=0s timeout=60s (my_fs-stop-interval-0s)
               monitor interval=61s OCF_CHECK_LEVEL=20 (my_fs-monitor-interval-61s)
  Resource: my_lvm2 (class=ocf provider=heartbeat type=LVM-activate)
   Attributes: vg_access_mode=system_id vgname=my_vg2
   Operations: monitor interval=30s timeout=90s (my_lvm2-monitor-interval-30s)
               start interval=0s timeout=90s (my_lvm2-start-interval-0s)
               stop interval=0s timeout=90s (my_lvm2-stop-interval-0s)
  Resource: my_fs2 (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/my_vg2/my_lv2 directory=/var/opt/mssql/userDBs fstype=ext4
   Operations: monitor interval=20s timeout=40s (my_fs2-monitor-interval-20s)
               notify interval=0s timeout=60s (my_fs2-notify-interval-0s)
               start interval=0s timeout=60s (my_fs2-start-interval-0s)
               stop interval=0s timeout=60s (my_fs2-stop-interval-0s)
               monitor interval=61s OCF_CHECK_LEVEL=20 (my_fs2-monitor-interval-61s)
  Resource: virtualIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=24 ip=192.168.10.47
   Operations: monitor interval=10s timeout=20s (virtualIP-monitor-interval-10s)
               start interval=0s timeout=20s (virtualIP-start-interval-0s)
               stop interval=0s timeout=20s (virtualIP-stop-interval-0s)
  Resource: RHEL79SQLFCI (class=ocf provider=mssql type=fci)
   Operations: monitor interval=10 timeout=30 (RHEL79SQLFCI-monitor-interval-10)
               start interval=0s timeout=60s (RHEL79SQLFCI-start-interval-0s)
               stop interval=0s timeout=20 (RHEL79SQLFCI-stop-interval-0s)

Stonith Devices:
 Resource: rhel79fence (class=stonith type=fence_scsi)
  Attributes: devices=/dev/sdb,/dev/sdc,/dev/sdaa,/dev/sdab,/dev/sdac,/dev/sdad,/dev/sdae,/dev/sdaf,/dev/sdag,/dev/sdah,/dev/sdai,/dev/sdaj,/dev/sdak,/dev/sdal,/dev/sdam,/dev/sdan,/dev/sdao,/dev/sdap,/dev/sdaq,/dev/sdar,/dev/sdas,/dev/sdat,/dev/sdau,/dev/sdav,/dev/sdaw,/dev/sdax,/dev/sday,/dev/sdaz,/dev/sdba,/dev/sdbb,/dev/sdbc,/dev/sdbd,/dev/sdbe,/dev/sdbf,/dev/sdbg,/dev/sdbh,/dev/sdbi,/dev/sdbj,/dev/sdbk,/dev/sdd,/dev/sde,/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi,/dev/sdj,/dev/sdk,/dev/sdl,/dev/sdm,/dev/sdn,/dev/sdo,/dev/sdp,/dev/sdq,/dev/sdr,/dev/sds,/dev/sdt,/dev/sdu,/dev/sdv,/dev/sdw,/dev/sdx,/dev/sdy,/dev/sdz pcmk_host_list="rhel79node1 rhel79node2 rhel79node3 rhel79node4 rhel79node5"
  Meta Attrs: pcmk_host_check=static-list pcmk_off_action=off pcmk_reboot_action=off provides=unfencing
  Operations: monitor interval=60s (rhel79fence-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness=1
Operations Defaults:
 timeout=60s

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: Redhat79-cls
 dc-version: 1.1.23-1.el7_9.1-9acf116022
 have-watchdog: false
 last-lrm-refresh: 1617883625

Quorum:
  Options:
[root@rhel79node1 ~]#

Comment 5 sreekantha 2021-04-09 05:09:29 UTC

Created attachment 1770492 [details]
pcsd-log

Comment 6 sreekantha 2021-04-09 05:11:37 UTC

Created attachment 1770493 [details]
corosync log

Comment 7 sreekantha 2021-04-15 06:52:40 UTC

Any updates to this PR?

Comment 8 sreekantha 2021-04-26 06:38:41 UTC

Any updates on this Bug?

Comment 9 Ken Gaillot 2021-04-26 14:45:05 UTC

(In reply to sreekantha from comment #8)
> Any updates on this Bug?

Hi,

Apologies for the delay. The key log message is:

Feb 28 21:43:45 [3667] rhel79node3       crmd:     crit: tengine_stonith_notify:        We were allegedly just fenced by rhel79node2 for rhel79node2!

This is a known timing issue. If a node that is fenced is able to reboot and rejoin the cluster before the fence agent returns the successful result, then the node will receive notification of its own fencing, and it will immediately stop pacemaker.

A workaround would be to insert a delay before pacemaker start-up at boot, for example with a systemd unit override for pacemaker.service to add something like ExecStartPre=/bin/sleep 15.

Since we do not have a bz for the issue yet, we can use this one to track it. There is no time frame for a fix at this point. I am reassigning it to RHEL 8 due to the life cycle phase of RHEL 7.

Comment 10 Ken Gaillot 2021-05-13 15:30:31 UTC

Closing this as a duplicate of Bug 1956687, which has a helpful reproducer

*** This bug has been marked as a duplicate of bug 1956687 ***