RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1463327 - Starting a larger cluster times out
Summary: Starting a larger cluster times out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcs
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: ---
Assignee: Tomas Jelinek
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-20 15:11 UTC by Radek Steiger
Modified: 2018-04-10 15:40 UTC (History)
5 users (show)

Fixed In Version: pcs-0.9.160-1.el7
Doc Type: Bug Fix
Doc Text:
Cause: User starts a cluster with larger number of nodes. Consequence: Pcs exits with an error - connection timeout. Fix: Increase timeout when larger number of nodes should be started. Result: Cluster starts successfully.
Clone Of:
Environment:
Last Closed: 2018-04-10 15:39:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
proposed fix (10.73 KB, patch)
2017-09-04 13:27 UTC, Tomas Jelinek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1229822 0 medium CLOSED [RFE] make "cluster setup --start", "cluster start" and "cluster standby" support --wait as well 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1284404 0 medium CLOSED make restarting pcsd a synchronous operation 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1292858 0 medium CLOSED pcs should timeout during network requests 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2018:0866 0 None None None 2018-04-10 15:40:17 UTC

Internal Links: 1229822 1284404 1292858

Description Radek Steiger 2017-06-20 15:11:28 UTC
> Description of problem:

No matter what numerical value does one add to the --wait flag it will always get ignored in the case of 'pcs cluster start' commands and possibly others. Two examples follow.

Doesn't wait for the requested 4 minutes, bails out after one minute instead:

[root@virt-270 ~]# time pcs cluster start --wait=240 --all
virt-286: Starting Cluster...
virt-281: Starting Cluster...
virt-280: Starting Cluster...
virt-294: Starting Cluster...
virt-282: Starting Cluster...
virt-284: Starting Cluster...
virt-298: Starting Cluster...
virt-296: Starting Cluster...
virt-299: Starting Cluster...
virt-303: Starting Cluster...
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
Error: unable to start all nodes
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
real	1m10.860s
user	0m9.056s
sys	0m1.912s

Doesn't bail out after 20 seconds:

[root@virt-270 ~]# time pcs cluster start --all --wait=20
virt-284: Starting Cluster...
virt-279: Starting Cluster...
virt-286: Starting Cluster...
virt-294: Starting Cluster...
virt-280: Starting Cluster...
virt-296: Starting Cluster...
virt-282: Starting Cluster...
virt-281: Starting Cluster...
virt-270: Starting Cluster...
virt-302: Starting Cluster...
virt-301: Starting Cluster...
virt-299: Starting Cluster...
virt-295: Starting Cluster...
virt-298: Starting Cluster...
virt-303: Starting Cluster...
virt-297: Starting Cluster...
Waiting for node(s) to start...
virt-281: Started
virt-294: Started
virt-284: Started
virt-298: Started
virt-296: Started
virt-282: Started
virt-279: Started
virt-286: Started
virt-302: Started
virt-280: Started
virt-295: Started
virt-299: Started
virt-301: Started
virt-303: Started
virt-297: Started
virt-270: Started
real	0m56.122s
user	0m25.162s
sys	0m5.102s


> Version-Release number of selected component (if applicable):

pcs-0.9.158-6.el7


> How reproducible:

Always


> Steps to Reproduce:

See the description.

Comment 2 Tomas Jelinek 2017-06-30 15:17:15 UTC
This is expected. The --wait flag enables waiting for pacemaker on the nodes to be fully started. Therefore its value does not apply to start-cluster requests. It seems the --request-timeout flag is what you are looking for. See bz1229822 and bz1292858 for details.

Comment 3 Radek Steiger 2017-06-30 15:23:03 UTC
@Tomas: Then we need a way to set timeout for the whole setup process to avoid getting 100% error rate for larger clusters.

Comment 4 Tomas Jelinek 2017-07-21 11:35:47 UTC
(In reply to Radek Steiger from comment #3)
> @Tomas: Then we need a way to set timeout for the whole setup process to
> avoid getting 100% error rate for larger clusters.
--request-timeout


Options:
* set the timeout based on number of nodes being started, so the timeout gets higher in large clusters
* change the error message, make it suggest setting higher timeout via --request-timeout parameter
* improve documentation, make it clear what is the difference between --wait and --request-timeout

Also we should take a look at why is it taking so long to start the cluster.

Comment 6 Tomas Jelinek 2017-09-04 13:27:25 UTC
Created attachment 1321856 [details]
proposed fix

When starting a cluster, pcs now sets timeout for the start request based on number of nodes being started:
* 1 to 8 nodes: 60s
* 9 to 16 nodes: 120s
* 17 to 24 nodes: 180s
* and so on
Users can override this by setting custom timeout in --request-timeout option.

Error messages now suggest using --request-timeout option when connection times out (new pcs architecture) or when unable to connect to a node for whatever reason (old pcs architecture, which does not detect timeouts).

Help and manpage have been improved and --request-timeout options is described explicitly for cluster start and cluster stop commands.

Comment 11 errata-xmlrpc 2018-04-10 15:39:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0866


Note You need to log in before you can comment on or make changes to this bug.