1731503 – Cluster nodes takes about 30 minutes to rejoin cluster (32 nodes).

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1731503 - Cluster nodes takes about 30 minutes to rejoin cluster (32 nodes).

Summary: Cluster nodes takes about 30 minutes to rejoin cluster (32 nodes).

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-19 14:59 UTC by michal novacek
Modified:	2021-04-20 23:13 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-15 07:37:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
'pcs cluster report' (12.17 MB, application/x-bzip) 2019-07-19 14:59 UTC, michal novacek	no flags	Details
View All

Description michal novacek 2019-07-19 14:59:15 UTC

Created attachment 1591985 [details]
'pcs cluster report'

Description of problem:

I'm testing 32 nodes cluster. There is only stonith resources (one for each node) and cloned clvmd/dlm resources running.

After killing 15 nodes with kernel panic they are corretly fenced and rebooted but they seem to stay _very_ long in pending state before joining cluster. Altogether it takes about 30 minutes before all nodes are back online in the cluster. This includes reboot time.

This seems like a lot to me given that those are quite strong phys machines (130GB RAM, 24core XEON). 

Version-Release number of selected component (if applicable): rhel7.7

How reproducible: always

Steps to Reproduce:
1. create cluster with 32 nodes
2. fence 15 nodes

Actual results: ~30 minutes before all nodes are back in cluster.

Expected results: much less time from when corosync is started to all nodes being online.


Additional info:

This might be the correct behaviour or some tuning issue. I'd like someone to confirm one or the other.

Comment 4 RHEL Program Management 2021-03-15 07:37:42 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 6 Ken Gaillot 2021-04-20 23:13:27 UTC

Apologies for the ridiculous delay on this one. I've been looking at it off and on over this time without any breakthroughs until today. One difficulty is that pcs did not collect corosync.log from the nodes.

I believe the delay is due to the Pacemaker unit file being ordered after time synchronization, which is failing on these nodes, e.g.:

Jul 19 14:55:24 f21-h09-000-r620 corosync[1802]: [MAIN ] Corosync Cluster Engine ('2.4.3'): started and ready to provide service.
...
Jul 19 15:05:14 f21-h09-000-r620 systemd: chrony-wait.service: main process exited, code=exited, status=1/FAILURE
Jul 19 15:05:14 f21-h09-000-r620 systemd: Failed to start Wait for chrony to synchronize system clock.
Jul 19 15:05:14 f21-h09-000-r620 systemd: Unit chrony-wait.service entered failed state.
Jul 19 15:05:14 f21-h09-000-r620 systemd: chrony-wait.service failed.
Jul 19 15:05:14 f21-h09-000-r620 systemd: Reached target System Time Synchronized.
Jul 19 15:05:14 f21-h09-000-r620 systemd: Started Pacemaker High Availability Cluster Manager.

The Pacemaker nodes will show as "pending" during this time because corosync is active but pacemaker is not. In this case it's only about 10 minutes, but I do see probe results coming in after this time, so the nodes would no longer be "pending" at that time. I didn't check more instances to confirm that this is the main issue, but with this being so stale now, I think it's not worth further investigation unless we can reproduce it on RHEL 8.

There are some unusual errors after this, but nothing that would affect the "pending" display, and the cluster should recover easily from them.

Note You need to log in before you can comment on or make changes to this bug.