Bug 1919005 - Large deployment seems to timeout after it says complete even when ansible still running
Summary: Large deployment seems to timeout after it says complete even when ansible st...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-tripleoclient
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z6
: 16.1 (Train on RHEL 8.2)
Assignee: Alex Schultz
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-21 21:42 UTC by Jeremy
Modified: 2024-03-25 17:56 UTC (History)
13 users (show)

Fixed In Version: python-tripleoclient-12.3.2-1.20210407123431.ae58329.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-26 13:50:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 773187 0 None MERGED Add auth_token_lifetime to undercloud.conf 2021-02-20 20:02:48 UTC
OpenStack gerrit 777326 0 None NEW Add auth_token_lifetime to undercloud.conf 2021-02-24 18:44:58 UTC
Red Hat Issue Tracker OSP-161 0 None None None 2022-10-03 14:34:09 UTC
Red Hat Product Errata RHBA-2021:2097 0 None None None 2021-05-26 13:51:10 UTC

Comment 3 Alex Schultz 2021-01-21 23:28:42 UTC
The default timeout value is defined in the workbook.
https://opendev.org/openstack/tripleo-common/src/branch/stable/train/workbooks/deployment.yaml#L124

This is what gets passed into the deployment workflow.
https://opendev.org/openstack/tripleo-common/src/branch/stable/train/workbooks/deployment.yaml#L211

However if you specify --config-download-timeout, this should be used as the value for the deployment timeout.
https://opendev.org/openstack/python-tripleoclient/src/branch/stable/train/tripleoclient/v1/overcloud_deploy.py#L1076

If you don't specify --config-download-timeout, the remainder of the time specified from the --timeout value should be used.
https://opendev.org/openstack/python-tripleoclient/src/branch/stable/train/tripleoclient/v1/overcloud_deploy.py#L1098

The config_download_timeout is specified:
https://opendev.org/openstack/python-tripleoclient/src/branch/stable/train/tripleoclient/workflows/deployment.py#L355

The default is 14400 (240mins):
https://opendev.org/openstack/tripleo-common/src/branch/stable/train/workbooks/deployment.yaml#L376

This should be used when invoking ansible-playbook:
https://opendev.org/openstack/tripleo-common/src/branch/stable/train/workbooks/deployment.yaml#L526

In theory --timeout 480 and --config-download-timeout 28800 should extend the overall deployment and config download timeouts.

Comment 4 Alex Schultz 2021-01-21 23:50:32 UTC
sorry --config-download-timeout 480 should be enough because we do the necessary math timeout = parsed_args.config_download_timeout * 60

Comment 5 Alex Schultz 2021-01-25 20:59:16 UTC
I tracked down where the timeout issue is actually happening. So while the timeouts are configurable, the overall deployment process is still at the mercy of the keystone auth token timeout.  The deployment workflow in mistral is running and it continually posts the output to zaqar so that the client can follow along. The problem comes when mistral fails to post the message to zaqar (failed: Error response from Zaqar. Code: 401) so the client quits and errors while the ansible execution may still be running.  

environments/undercloud.yaml:  TokenExpiration: 14400

So the default timeout is 240minutes.  The TokenExpiration needs to be larger than the longest deployment time.  Providing an update via an environment file in undercloud.conf and re-running the undercloud installation should increase this.

Comment 6 David Rosenfeld 2021-01-28 21:00:27 UTC
Set qe_test_coverage to - because large deployments aren't appropriate for automation in CI. Also, it looks like the fix may be to change existing configuration parameters.

Comment 7 Allan Greentree 2021-02-07 16:36:16 UTC
latest comment in case from customer:

So the latest advice from engineering we got was to run /var/lib/mistral/overcloud/ansible-playbook-command.sh directly and reduce the number of forks to 25 from 480. This method somehow works however it is extremely slow. We're hitting playbook errors as we go which we clear along the way. However last error happened after more than 16hrs of the playbook run, which means that we have to rerun and wait at least another 16hrs for playbook to finish.

Sunday 07 February 2021  03:21:57 +0000 (0:00:08.434)       16:55:46.307 ******
===============================================================================
tripleo-hosts-entries : Render out the hosts entries ----------------------------------------------------------------------------------------------------------------------------------------------------- 504.35s
Render all_nodes data as group_vars for overcloud -------------------------------------------------------------------------------------------------------------------------------------------------------- 465.34s
redhat-subscription : Manage Red Hat subscription -------------------------------------------------------------------------------------------------------------------------------------------------------- 341.54s
redhat-subscription : Configure repository subscriptions ------------------------------------------------------------------------------------------------------------------------------------------------- 250.51s
redhat-subscription : Manage Red Hat subscription -------------------------------------------------------------------------------------------------------------------------------------------------------- 162.71s
redhat-subscription : Manage Red Hat subscription -------------------------------------------------------------------------------------------------------------------------------------------------------- 133.45s
redhat-subscription : Manage Red Hat subscription -------------------------------------------------------------------------------------------------------------------------------------------------------- 132.44s
tripleo-hieradata : Render hieradata from template ------------------------------------------------------------------------------------------------------------------------------------------------------- 123.87s
include_tasks -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 120.23s
Ensure ansible_managed hieradata file exists ------------------------------------------------------------------------------------------------------------------------------------------------------------- 118.00s
Hieradata from vars -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 117.82s
include_role : tripleo-ssh-known-hosts ------------------------------------------------------------------------------------------------------------------------------------------------------------------- 116.98s
Hiera config --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 116.80s
redhat-subscription : Configure repository subscriptions ------------------------------------------------------------------------------------------------------------------------------------------------- 115.70s
Configure Hosts Entries ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 114.73s
include_role : tripleo-bootstrap ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 113.16s
redhat-subscription : Configure repository subscriptions ------------------------------------------------------------------------------------------------------------------------------------------------- 103.10s
include_role : tuned ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 101.40s
redhat-subscription : Configure repository subscriptions -------------------------------------------------------------------------------------------------------------------------------------------------- 98.88s
Install, Configure and Run Chrony ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 98.54s

I'm going to include /var/lib/mistral/overcloud directory which apart from the other stuff contains ansible.log + ansible.cfg.

Comment 8 Alex Schultz 2021-02-16 21:07:28 UTC
The upstream patch is to allow a user to specify the keystone token expiration as part of the undercloud.conf since it is something that needs to be configured easily for scale. Efforts to address ansible issues as part of the OSP deployment processes are being tracked as part of Bug 1911891

Comment 9 Alex Schultz 2021-02-24 18:44:58 UTC
We'll be using this bug to track the ability to configure the keystone life time via undercloud.conf.  The additional issues with execution time will be tracked via Bug 1911891

Comment 17 David Rosenfeld 2021-04-20 20:33:04 UTC
Used a 1cont_1comp_3ceph topology:

With auth_token_lifetime = 500 in undercloud.conf overcloud deploy times out
With auth_token_lifetime = 1000 in undercloud.conf overcloud deploy is successful

That means auth_token_lifetime may be specified in undercloud.conf and the specified value used.

Comment 24 errata-xmlrpc 2021-05-26 13:50:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2097


Note You need to log in before you can comment on or make changes to this bug.