Bug 1731503

Summary: Cluster nodes takes about 30 minutes to rejoin cluster (32 nodes).
Product: Red Hat Enterprise Linux 7 Reporter: michal novacek <mnovacek>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.7CC: cluster-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-15 07:37:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
'pcs cluster report' none

Description michal novacek 2019-07-19 14:59:15 UTC
Created attachment 1591985 [details]
'pcs cluster report'

Description of problem:

I'm testing 32 nodes cluster. There is only stonith resources (one for each node) and cloned clvmd/dlm resources running.

After killing 15 nodes with kernel panic they are corretly fenced and rebooted but they seem to stay _very_ long in pending state before joining cluster. Altogether it takes about 30 minutes before all nodes are back online in the cluster. This includes reboot time.

This seems like a lot to me given that those are quite strong phys machines (130GB RAM, 24core XEON). 

Version-Release number of selected component (if applicable): rhel7.7

How reproducible: always

Steps to Reproduce:
1. create cluster with 32 nodes
2. fence 15 nodes

Actual results: ~30 minutes before all nodes are back in cluster.

Expected results: much less time from when corosync is started to all nodes being online.


Additional info:

This might be the correct behaviour or some tuning issue. I'd like someone to confirm one or the other.

Comment 4 RHEL Program Management 2021-03-15 07:37:42 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 6 Ken Gaillot 2021-04-20 23:13:27 UTC
Apologies for the ridiculous delay on this one. I've been looking at it off and on over this time without any breakthroughs until today. One difficulty is that pcs did not collect corosync.log from the nodes.

I believe the delay is due to the Pacemaker unit file being ordered after time synchronization, which is failing on these nodes, e.g.:

    Jul 19 14:55:24 f21-h09-000-r620 corosync[1802]: [MAIN  ] Corosync Cluster Engine ('2.4.3'): started and ready to provide service.
    ...
    Jul 19 15:05:14 f21-h09-000-r620 systemd: chrony-wait.service: main process exited, code=exited, status=1/FAILURE
    Jul 19 15:05:14 f21-h09-000-r620 systemd: Failed to start Wait for chrony to synchronize system clock.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: Unit chrony-wait.service entered failed state.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: chrony-wait.service failed.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: Reached target System Time Synchronized.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: Started Pacemaker High Availability Cluster Manager.

The Pacemaker nodes will show as "pending" during this time because corosync is active but pacemaker is not. In this case it's only about 10 minutes, but I do see probe results coming in after this time, so the nodes would no longer be "pending" at that time. I didn't check more instances to confirm that this is the main issue, but with this being so stale now, I think it's not worth further investigation unless we can reproduce it on RHEL 8.

There are some unusual errors after this, but nothing that would affect the "pending" display, and the cluster should recover easily from them.