Bug 2103867

Summary: Host in cluster rebooting upon array side controller failover
Product: Red Hat Enterprise Linux 7 Reporter: Govind Kulkarni <govind.kulkarni>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED INSUFFICIENT_DATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.8CC: ccaulfie, cluster-maint, jfriesse
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-12 20:04:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Govind Kulkarni 2022-07-05 06:57:47 UTC
Description of problem:

RHEL7.8 hosts, configured HA active/passive in NFS cluster.
Bith hosts, host1 & host2 mounted the NFS share volume and triggered the IO.
Failover was triggerred on storage array end. Which leads to IO drop and host reboot.

How reproducible:
Always

Steps to Reproduce:
1. Configure hosts in NFS cluster
2. Start IO
3. Trigger Controller failover on Nimble array.

Actual results:
IO drops and host reboots

Expected results:
IO should continue to run without disruption.

Additional info:
Mar 27 14:10:55 iwf-dl360-17 crmd[3606]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Mar 27 14:10:55 iwf-dl360-17 pengine[3605]: notice: Scheduling shutdown of node iwf-dl360-18
Mar 27 14:10:55 iwf-dl360-17 pengine[3605]: notice: * Shutdown iwf-dl360-18
Mar 27 14:10:55 iwf-dl360-17 pengine[3605]: notice: Calculated transition 6, saving inputs in /var/lib/pacemaker/pengine/pe-input-287.bz2
Mar 27 14:10:55 iwf-dl360-17 crmd[3606]: notice: Transition 6 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-287.bz2): Complete
Mar 27 14:10:55 iwf-dl360-17 crmd[3606]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Mar 27 14:10:55 iwf-dl360-17 crmd[3606]: notice: do_shutdown of peer iwf-dl360-18 is complete
Mar 27 14:10:55 iwf-dl360-17 attrd[3604]: notice: Node iwf-dl360-18 state is now lost
Mar 27 14:10:55 iwf-dl360-17 attrd[3604]: notice: Removing all iwf-dl360-18 attributes for peer loss
Mar 27 14:10:55 iwf-dl360-17 attrd[3604]: notice: Purged 1 peer with id=2 and/or uname=iwf-dl360-18 from the membership cache
Mar 27 14:10:55 iwf-dl360-17 stonith-ng[3602]: notice: Node iwf-dl360-18 state is now lost
Mar 27 14:10:55 iwf-dl360-17 stonith-ng[3602]: notice: Purged 1 peer with id=2 and/or uname=iwf-dl360-18 from the membership cache
Mar 27 14:10:55 iwf-dl360-17 cib[3601]: notice: Node iwf-dl360-18 state is now lost
Mar 27 14:10:55 iwf-dl360-17 cib[3601]: notice: Purged 1 peer with id=2 and/or uname=iwf-dl360-18 from the membership cache

Mar 27 14:10:55 iwf-dl360-17 corosync[3174]: [TOTEM ] A new membership (10.201.14.83:547) was formed. Members left: 2
Mar 27 14:10:55 iwf-dl360-17 corosync[3174]: [CPG ] downlist left_list: 1 received
Mar 27 14:10:55 iwf-dl360-17 corosync[3174]: [QUORUM] Members[1]: 1
Mar 27 14:10:55 iwf-dl360-17 corosync[3174]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 27 14:10:55 iwf-dl360-17 crmd[3606]: notice: Node iwf-dl360-18 state is now lost
Mar 27 14:10:55 iwf-dl360-17 pacemakerd[3591]: notice: Node iwf-dl360-18 state is now lost
Mar 27 14:10:55 iwf-dl360-17 crmd[3606]: notice: do_shutdown of peer iwf-dl360-18 is complete
Mar 27 14:11:04 iwf-dl360-17 systemd: Reloading.

Comment 3 Govind Kulkarni 2022-07-05 10:08:51 UTC
Due to network issues, same bug got filed 3 times 2103867/2103866 and 2103862.
Please close any two of them. Sorry for the inconvenience.

Comment 4 Jan Friesse 2022-07-07 07:08:22 UTC
*** Bug 2103862 has been marked as a duplicate of this bug. ***

Comment 5 Jan Friesse 2022-07-07 07:08:26 UTC
*** Bug 2103866 has been marked as a duplicate of this bug. ***

Comment 6 Jan Friesse 2022-07-07 07:22:38 UTC
From the short part of log I'm pretty sure this is not corosync bug:
Mar 27 14:10:55 iwf-dl360-17 pengine[3605]: notice: Scheduling shutdown of node iwf-dl360-18
Mar 27 14:10:55 iwf-dl360-17 pengine[3605]: notice: * Shutdown iwf-dl360-18
...

so it was pacemaker decision to shutdown iwf-dl360-18. For now, assigning to pacemaker for deeper investigation - but I would recommend to take a look to nfs volume timeouts.

Comment 7 Ken Gaillot 2022-07-07 20:09:30 UTC
Hi,

It is unclear which component is relevant here without further information.

Further investigation is best done in a support case, which can look at the wider environment and how components work together, rather than here in bugzilla, which focuses on bugs in a single software component. You can initiate a case with Red Hat's Global Support Services group through one of the methods listed at the following link:

  https://access.redhat.com/start/how-to-engage-red-hat-support

From there, we can collect additional information and take a closer look at the specifics of this incident to help resolve the underlying problem.