Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Cause:
User starts a cluster with larger number of nodes.
Consequence:
Pcs exits with an error - connection timeout.
Fix:
Increase timeout when larger number of nodes should be started.
Result:
Cluster starts successfully.
> Description of problem:
No matter what numerical value does one add to the --wait flag it will always get ignored in the case of 'pcs cluster start' commands and possibly others. Two examples follow.
Doesn't wait for the requested 4 minutes, bails out after one minute instead:
[root@virt-270 ~]# time pcs cluster start --wait=240 --all
virt-286: Starting Cluster...
virt-281: Starting Cluster...
virt-280: Starting Cluster...
virt-294: Starting Cluster...
virt-282: Starting Cluster...
virt-284: Starting Cluster...
virt-298: Starting Cluster...
virt-296: Starting Cluster...
virt-299: Starting Cluster...
virt-303: Starting Cluster...
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
Error: unable to start all nodes
virt-270: Unable to connect to virt-270 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-279: Unable to connect to virt-279 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
virt-295: Unable to connect to virt-295 (Operation timed out after 60000 milliseconds with 0 out of -1 bytes received)
virt-297: Unable to connect to virt-297 (Operation timed out after 60004 milliseconds with 0 out of -1 bytes received)
virt-301: Unable to connect to virt-301 (Operation timed out after 60002 milliseconds with 0 out of -1 bytes received)
virt-302: Unable to connect to virt-302 (Operation timed out after 60001 milliseconds with 0 out of -1 bytes received)
real 1m10.860s
user 0m9.056s
sys 0m1.912s
Doesn't bail out after 20 seconds:
[root@virt-270 ~]# time pcs cluster start --all --wait=20
virt-284: Starting Cluster...
virt-279: Starting Cluster...
virt-286: Starting Cluster...
virt-294: Starting Cluster...
virt-280: Starting Cluster...
virt-296: Starting Cluster...
virt-282: Starting Cluster...
virt-281: Starting Cluster...
virt-270: Starting Cluster...
virt-302: Starting Cluster...
virt-301: Starting Cluster...
virt-299: Starting Cluster...
virt-295: Starting Cluster...
virt-298: Starting Cluster...
virt-303: Starting Cluster...
virt-297: Starting Cluster...
Waiting for node(s) to start...
virt-281: Started
virt-294: Started
virt-284: Started
virt-298: Started
virt-296: Started
virt-282: Started
virt-279: Started
virt-286: Started
virt-302: Started
virt-280: Started
virt-295: Started
virt-299: Started
virt-301: Started
virt-303: Started
virt-297: Started
virt-270: Started
real 0m56.122s
user 0m25.162s
sys 0m5.102s
> Version-Release number of selected component (if applicable):
pcs-0.9.158-6.el7
> How reproducible:
Always
> Steps to Reproduce:
See the description.
This is expected. The --wait flag enables waiting for pacemaker on the nodes to be fully started. Therefore its value does not apply to start-cluster requests. It seems the --request-timeout flag is what you are looking for. See bz1229822 and bz1292858 for details.
(In reply to Radek Steiger from comment #3)
> @Tomas: Then we need a way to set timeout for the whole setup process to
> avoid getting 100% error rate for larger clusters.
--request-timeout
Options:
* set the timeout based on number of nodes being started, so the timeout gets higher in large clusters
* change the error message, make it suggest setting higher timeout via --request-timeout parameter
* improve documentation, make it clear what is the difference between --wait and --request-timeout
Also we should take a look at why is it taking so long to start the cluster.
Created attachment 1321856[details]
proposed fix
When starting a cluster, pcs now sets timeout for the start request based on number of nodes being started:
* 1 to 8 nodes: 60s
* 9 to 16 nodes: 120s
* 17 to 24 nodes: 180s
* and so on
Users can override this by setting custom timeout in --request-timeout option.
Error messages now suggest using --request-timeout option when connection times out (new pcs architecture) or when unable to connect to a node for whatever reason (old pcs architecture, which does not detect timeouts).
Help and manpage have been improved and --request-timeout options is described explicitly for cluster start and cluster stop commands.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2018:0866