Bug 1731503 - Cluster nodes takes about 30 minutes to rejoin cluster (32 nodes).
Summary: Cluster nodes takes about 30 minutes to rejoin cluster (32 nodes).
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
Depends On:
TreeView+ depends on / blocked
Reported: 2019-07-19 14:59 UTC by michal novacek
Modified: 2021-04-20 23:13 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-03-15 07:37:42 UTC
Target Upstream Version:

Attachments (Terms of Use)
'pcs cluster report' (12.17 MB, application/x-bzip)
2019-07-19 14:59 UTC, michal novacek
no flags Details

Description michal novacek 2019-07-19 14:59:15 UTC
Created attachment 1591985 [details]
'pcs cluster report'

Description of problem:

I'm testing 32 nodes cluster. There is only stonith resources (one for each node) and cloned clvmd/dlm resources running.

After killing 15 nodes with kernel panic they are corretly fenced and rebooted but they seem to stay _very_ long in pending state before joining cluster. Altogether it takes about 30 minutes before all nodes are back online in the cluster. This includes reboot time.

This seems like a lot to me given that those are quite strong phys machines (130GB RAM, 24core XEON). 

Version-Release number of selected component (if applicable): rhel7.7

How reproducible: always

Steps to Reproduce:
1. create cluster with 32 nodes
2. fence 15 nodes

Actual results: ~30 minutes before all nodes are back in cluster.

Expected results: much less time from when corosync is started to all nodes being online.

Additional info:

This might be the correct behaviour or some tuning issue. I'd like someone to confirm one or the other.

Comment 4 RHEL Program Management 2021-03-15 07:37:42 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 6 Ken Gaillot 2021-04-20 23:13:27 UTC
Apologies for the ridiculous delay on this one. I've been looking at it off and on over this time without any breakthroughs until today. One difficulty is that pcs did not collect corosync.log from the nodes.

I believe the delay is due to the Pacemaker unit file being ordered after time synchronization, which is failing on these nodes, e.g.:

    Jul 19 14:55:24 f21-h09-000-r620 corosync[1802]: [MAIN  ] Corosync Cluster Engine ('2.4.3'): started and ready to provide service.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: chrony-wait.service: main process exited, code=exited, status=1/FAILURE
    Jul 19 15:05:14 f21-h09-000-r620 systemd: Failed to start Wait for chrony to synchronize system clock.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: Unit chrony-wait.service entered failed state.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: chrony-wait.service failed.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: Reached target System Time Synchronized.
    Jul 19 15:05:14 f21-h09-000-r620 systemd: Started Pacemaker High Availability Cluster Manager.

The Pacemaker nodes will show as "pending" during this time because corosync is active but pacemaker is not. In this case it's only about 10 minutes, but I do see probe results coming in after this time, so the nodes would no longer be "pending" at that time. I didn't check more instances to confirm that this is the main issue, but with this being so stale now, I think it's not worth further investigation unless we can reproduce it on RHEL 8.

There are some unusual errors after this, but nothing that would affect the "pending" display, and the cluster should recover easily from them.

Note You need to log in before you can comment on or make changes to this bug.