Bug 1463327

Summary:

Starting a larger cluster times out

Product:

Red Hat Enterprise Linux 7

Reporter:

Radek Steiger <rsteiger>

Component:

pcs

Assignee:

Tomas Jelinek <tojeline>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

unspecified

Docs Contact:

Priority:

high

Version:

7.4

CC:

cfeist, cluster-maint, idevat, omular, tojeline

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pcs-0.9.160-1.el7

Doc Type:

Bug Fix

Doc Text:

Cause: User starts a cluster with larger number of nodes. Consequence: Pcs exits with an error - connection timeout. Fix: Increase timeout when larger number of nodes should be started. Result: Cluster starts successfully.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-04-10 15:39:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
proposed fix	none

Description Radek Steiger 2017-06-20 15:11:28 UTC

> Description of problem:

No matter what numerical value does one add to the --wait flag it will always get ignored in the case of 'pcs cluster start' commands and possibly others. Two examples follow.

Doesn't wait for the requested 4 minutes, bails out after one minute instead:

[root@virt-270 ~]# time pcs cluster start --wait=240 --all
virt-286: Starting Cluster...
virt-281: Starting Cluster...
virt-280: Starting Cluster...
virt-294: Starting Cluster...
virt-282: Starting Cluster...
virt-284: Starting Cluster...
virt-298: Starting Cluster...
virt-296: Starting Cluster...
virt-299: Starting Cluster...
virt-303: Starting Cluster...
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
Error: unable to start all nodes
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
real	1m10.860s
user	0m9.056s
sys	0m1.912s

Doesn't bail out after 20 seconds:

[root@virt-270 ~]# time pcs cluster start --all --wait=20
virt-284: Starting Cluster...
virt-279: Starting Cluster...
virt-286: Starting Cluster...
virt-294: Starting Cluster...
virt-280: Starting Cluster...
virt-296: Starting Cluster...
virt-282: Starting Cluster...
virt-281: Starting Cluster...
virt-270: Starting Cluster...
virt-302: Starting Cluster...
virt-301: Starting Cluster...
virt-299: Starting Cluster...
virt-295: Starting Cluster...
virt-298: Starting Cluster...
virt-303: Starting Cluster...
virt-297: Starting Cluster...
Waiting for node(s) to start...
virt-281: Started
virt-294: Started
virt-284: Started
virt-298: Started
virt-296: Started
virt-282: Started
virt-279: Started
virt-286: Started
virt-302: Started
virt-280: Started
virt-295: Started
virt-299: Started
virt-301: Started
virt-303: Started
virt-297: Started
virt-270: Started
real	0m56.122s
user	0m25.162s
sys	0m5.102s


> Version-Release number of selected component (if applicable):

pcs-0.9.158-6.el7


> How reproducible:

Always


> Steps to Reproduce:

See the description.

Comment 2 Tomas Jelinek 2017-06-30 15:17:15 UTC

This is expected. The --wait flag enables waiting for pacemaker on the nodes to be fully started. Therefore its value does not apply to start-cluster requests. It seems the --request-timeout flag is what you are looking for. See bz1229822 and bz1292858 for details.

Comment 3 Radek Steiger 2017-06-30 15:23:03 UTC

@Tomas: Then we need a way to set timeout for the whole setup process to avoid getting 100% error rate for larger clusters.

Comment 4 Tomas Jelinek 2017-07-21 11:35:47 UTC

(In reply to Radek Steiger from comment #3)
> @Tomas: Then we need a way to set timeout for the whole setup process to
> avoid getting 100% error rate for larger clusters.
--request-timeout


Options:
* set the timeout based on number of nodes being started, so the timeout gets higher in large clusters
* change the error message, make it suggest setting higher timeout via --request-timeout parameter
* improve documentation, make it clear what is the difference between --wait and --request-timeout

Also we should take a look at why is it taking so long to start the cluster.

Comment 6 Tomas Jelinek 2017-09-04 13:27:25 UTC

Created attachment 1321856 [details]
proposed fix

When starting a cluster, pcs now sets timeout for the start request based on number of nodes being started:
* 1 to 8 nodes: 60s
* 9 to 16 nodes: 120s
* 17 to 24 nodes: 180s
* and so on
Users can override this by setting custom timeout in --request-timeout option.

Error messages now suggest using --request-timeout option when connection times out (new pcs architecture) or when unable to connect to a node for whatever reason (old pcs architecture, which does not detect timeouts).

Help and manpage have been improved and --request-timeout options is described explicitly for cluster start and cluster stop commands.

Comment 11 errata-xmlrpc 2018-04-10 15:39:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0866