Bug 1463327
Summary: | Starting a larger cluster times out | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Radek Steiger <rsteiger> | ||||
Component: | pcs | Assignee: | Tomas Jelinek <tojeline> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.4 | CC: | cfeist, cluster-maint, idevat, omular, tojeline | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | pcs-0.9.160-1.el7 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause:
User starts a cluster with larger number of nodes.
Consequence:
Pcs exits with an error - connection timeout.
Fix:
Increase timeout when larger number of nodes should be started.
Result:
Cluster starts successfully.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-10 15:39:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Radek Steiger
2017-06-20 15:11:28 UTC
This is expected. The --wait flag enables waiting for pacemaker on the nodes to be fully started. Therefore its value does not apply to start-cluster requests. It seems the --request-timeout flag is what you are looking for. See bz1229822 and bz1292858 for details. @Tomas: Then we need a way to set timeout for the whole setup process to avoid getting 100% error rate for larger clusters. (In reply to Radek Steiger from comment #3) > @Tomas: Then we need a way to set timeout for the whole setup process to > avoid getting 100% error rate for larger clusters. --request-timeout Options: * set the timeout based on number of nodes being started, so the timeout gets higher in large clusters * change the error message, make it suggest setting higher timeout via --request-timeout parameter * improve documentation, make it clear what is the difference between --wait and --request-timeout Also we should take a look at why is it taking so long to start the cluster. Created attachment 1321856 [details]
proposed fix
When starting a cluster, pcs now sets timeout for the start request based on number of nodes being started:
* 1 to 8 nodes: 60s
* 9 to 16 nodes: 120s
* 17 to 24 nodes: 180s
* and so on
Users can override this by setting custom timeout in --request-timeout option.
Error messages now suggest using --request-timeout option when connection times out (new pcs architecture) or when unable to connect to a node for whatever reason (old pcs architecture, which does not detect timeouts).
Help and manpage have been improved and --request-timeout options is described explicitly for cluster start and cluster stop commands.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0866 |