Bug 1463327 - Starting a larger cluster times out
Summary: Starting a larger cluster times out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcs
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: ---
Assignee: Tomas Jelinek
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-20 15:11 UTC by Radek Steiger
Modified: 2018-04-10 15:40 UTC (History)
5 users (show)

Fixed In Version: pcs-0.9.160-1.el7
Doc Type: Bug Fix
Doc Text:
Cause: User starts a cluster with larger number of nodes. Consequence: Pcs exits with an error - connection timeout. Fix: Increase timeout when larger number of nodes should be started. Result: Cluster starts successfully.
Clone Of:
Environment:
Last Closed: 2018-04-10 15:39:15 UTC
Target Upstream Version:


Attachments (Terms of Use)
proposed fix (10.73 KB, patch)
2017-09-04 13:27 UTC, Tomas Jelinek
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1229822 None None None Never
Red Hat Bugzilla 1284404 None None None Never
Red Hat Bugzilla 1292858 None None None Never
Red Hat Product Errata RHBA-2018:0866 None None None 2018-04-10 15:40:17 UTC

Internal Links: 1229822 1284404 1292858

Description Radek Steiger 2017-06-20 15:11:28 UTC
> Description of problem:

No matter what numerical value does one add to the --wait flag it will always get ignored in the case of 'pcs cluster start' commands and possibly others. Two examples follow.

Doesn't wait for the requested 4 minutes, bails out after one minute instead:

[root@virt-270 ~]# time pcs cluster start --wait=240 --all
virt-286: Starting Cluster...
virt-281: Starting Cluster...
virt-280: Starting Cluster...
virt-294: Starting Cluster...
virt-282: Starting Cluster...
virt-284: Starting Cluster...
virt-298: Starting Cluster...
virt-296: Starting Cluster...
virt-299: Starting Cluster...
virt-303: Starting Cluster...
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
Error: unable to start all nodes
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
real	1m10.860s
user	0m9.056s
sys	0m1.912s

Doesn't bail out after 20 seconds:

[root@virt-270 ~]# time pcs cluster start --all --wait=20
virt-284: Starting Cluster...
virt-279: Starting Cluster...
virt-286: Starting Cluster...
virt-294: Starting Cluster...
virt-280: Starting Cluster...
virt-296: Starting Cluster...
virt-282: Starting Cluster...
virt-281: Starting Cluster...
virt-270: Starting Cluster...
virt-302: Starting Cluster...
virt-301: Starting Cluster...
virt-299: Starting Cluster...
virt-295: Starting Cluster...
virt-298: Starting Cluster...
virt-303: Starting Cluster...
virt-297: Starting Cluster...
Waiting for node(s) to start...
virt-281: Started
virt-294: Started
virt-284: Started
virt-298: Started
virt-296: Started
virt-282: Started
virt-279: Started
virt-286: Started
virt-302: Started
virt-280: Started
virt-295: Started
virt-299: Started
virt-301: Started
virt-303: Started
virt-297: Started
virt-270: Started
real	0m56.122s
user	0m25.162s
sys	0m5.102s


> Version-Release number of selected component (if applicable):

pcs-0.9.158-6.el7


> How reproducible:

Always


> Steps to Reproduce:

See the description.

Comment 2 Tomas Jelinek 2017-06-30 15:17:15 UTC
This is expected. The --wait flag enables waiting for pacemaker on the nodes to be fully started. Therefore its value does not apply to start-cluster requests. It seems the --request-timeout flag is what you are looking for. See bz1229822 and bz1292858 for details.

Comment 3 Radek Steiger 2017-06-30 15:23:03 UTC
@Tomas: Then we need a way to set timeout for the whole setup process to avoid getting 100% error rate for larger clusters.

Comment 4 Tomas Jelinek 2017-07-21 11:35:47 UTC
(In reply to Radek Steiger from comment #3)
> @Tomas: Then we need a way to set timeout for the whole setup process to
> avoid getting 100% error rate for larger clusters.
--request-timeout


Options:
* set the timeout based on number of nodes being started, so the timeout gets higher in large clusters
* change the error message, make it suggest setting higher timeout via --request-timeout parameter
* improve documentation, make it clear what is the difference between --wait and --request-timeout

Also we should take a look at why is it taking so long to start the cluster.

Comment 6 Tomas Jelinek 2017-09-04 13:27:25 UTC
Created attachment 1321856 [details]
proposed fix

When starting a cluster, pcs now sets timeout for the start request based on number of nodes being started:
* 1 to 8 nodes: 60s
* 9 to 16 nodes: 120s
* 17 to 24 nodes: 180s
* and so on
Users can override this by setting custom timeout in --request-timeout option.

Error messages now suggest using --request-timeout option when connection times out (new pcs architecture) or when unable to connect to a node for whatever reason (old pcs architecture, which does not detect timeouts).

Help and manpage have been improved and --request-timeout options is described explicitly for cluster start and cluster stop commands.

Comment 11 errata-xmlrpc 2018-04-10 15:39:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0866


Note You need to log in before you can comment on or make changes to this bug.