Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1463327

Summary:

Starting a larger cluster times out

Product:

Red Hat Enterprise Linux 7

Reporter:

Radek Steiger <rsteiger>

Component:

pcs

Assignee:

Tomas Jelinek <tojeline>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

unspecified

Docs Contact:

Priority:

high

Version:

7.4

CC:

cfeist, cluster-maint, idevat, omular, tojeline

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pcs-0.9.160-1.el7

Doc Type:

Bug Fix

Doc Text:

Cause: User starts a cluster with larger number of nodes. Consequence: Pcs exits with an error - connection timeout. Fix: Increase timeout when larger number of nodes should be started. Result: Cluster starts successfully.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-04-10 15:39:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
proposed fix	none

Description Radek Steiger 2017-06-20 15:11:28 UTC

> Description of problem:

No matter what numerical value does one add to the --wait flag it will always get ignored in the case of 'pcs cluster start' commands and possibly others. Two examples follow.

Doesn't wait for the requested 4 minutes, bails out after one minute instead:

[root@virt-270 ~]# time pcs cluster start --wait=240 --all
virt-286: Starting Cluster...
virt-281: Starting Cluster...
virt-280: Starting Cluster...
virt-294: Starting Cluster...
virt-282: Starting Cluster...
virt-284: Starting Cluster...
virt-298: Starting Cluster...
virt-296: Starting Cluster...
virt-299: Starting Cluster...
virt-303: Starting Cluster...
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
Error: unable to start all nodes
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
real	1m10.860s
user	0m9.056s
sys	0m1.912s

Doesn't bail out after 20 seconds:

[root@virt-270 ~]# time pcs cluster start --all --wait=20
virt-284: Starting Cluster...
virt-279: Starting Cluster...
virt-286: Starting Cluster...
virt-294: Starting Cluster...
virt-280: Starting Cluster...
virt-296: Starting Cluster...
virt-282: Starting Cluster...
virt-281: Starting Cluster...
virt-270: Starting Cluster...
virt-302: Starting Cluster...
virt-301: Starting Cluster...
virt-299: Starting Cluster...
virt-295: Starting Cluster...
virt-298: Starting Cluster...
virt-303: Starting Cluster...
virt-297: Starting Cluster...
Waiting for node(s) to start...
virt-281: Started
virt-294: Started
virt-284: Started
virt-298: Started
virt-296: Started
virt-282: Started
virt-279: Started
virt-286: Started
virt-302: Started
virt-280: Started
virt-295: Started
virt-299: Started
virt-301: Started
virt-303: Started
virt-297: Started
virt-270: Started
real	0m56.122s
user	0m25.162s
sys	0m5.102s


> Version-Release number of selected component (if applicable):

pcs-0.9.158-6.el7


> How reproducible:

Always


> Steps to Reproduce:

See the description.

Comment 2 Tomas Jelinek 2017-06-30 15:17:15 UTC

This is expected. The --wait flag enables waiting for pacemaker on the nodes to be fully started. Therefore its value does not apply to start-cluster requests. It seems the --request-timeout flag is what you are looking for. See bz1229822 and bz1292858 for details.

Comment 3 Radek Steiger 2017-06-30 15:23:03 UTC

@Tomas: Then we need a way to set timeout for the whole setup process to avoid getting 100% error rate for larger clusters.

Comment 4 Tomas Jelinek 2017-07-21 11:35:47 UTC

(In reply to Radek Steiger from comment #3)
> @Tomas: Then we need a way to set timeout for the whole setup process to
> avoid getting 100% error rate for larger clusters.
--request-timeout


Options:
* set the timeout based on number of nodes being started, so the timeout gets higher in large clusters
* change the error message, make it suggest setting higher timeout via --request-timeout parameter
* improve documentation, make it clear what is the difference between --wait and --request-timeout

Also we should take a look at why is it taking so long to start the cluster.

Comment 6 Tomas Jelinek 2017-09-04 13:27:25 UTC

Created attachment 1321856 [details]
proposed fix

When starting a cluster, pcs now sets timeout for the start request based on number of nodes being started:
* 1 to 8 nodes: 60s
* 9 to 16 nodes: 120s
* 17 to 24 nodes: 180s
* and so on
Users can override this by setting custom timeout in --request-timeout option.

Error messages now suggest using --request-timeout option when connection times out (new pcs architecture) or when unable to connect to a node for whatever reason (old pcs architecture, which does not detect timeouts).

Help and manpage have been improved and --request-timeout options is described explicitly for cluster start and cluster stop commands.

Comment 11 errata-xmlrpc 2018-04-10 15:39:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0866