Summary: | Config-download: Ansible completed but overcloud deploy command hangs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sai Sindhur Malleni <smalleni> | ||||
Component: | openstack-tripleo-heat-templates | Assignee: | Luke Short <lshort> | ||||
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 16.1 (Train) | CC: | bdobreli, cjeanner, dmacpher, drosenfe, emacchi, gregraka, jamsmith, kecarter, lshort, mburns, mivollme, moddi, morazi, nchandek, psahoo, ravsingh, sclewis, slinaber, spower | ||||
Target Milestone: | z2 | Keywords: | Documentation, Reopened, Triaged | ||||
Target Release: | 16.1 (Train on RHEL 8.2) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-tripleo-heat-templates-11.3.2-1.20200811043436.02c63d2.el8 | Doc Type: | Bug Fix | ||||
Doc Text: |
Inadequate timeout values can cause an overcloud deployment to fail after four hours. To prevent these timeout failures, set the following undercloud and overcloud timeout parameters:
** Undercloud timeouts (seconds):
+
.Example
+
----
parameter_defaults:
TokenExpiration: 86400
ZaqarWsTimeout: 86400
----
** Overcloud deploy timeouts (minutes):
+
.Example
+
----
$ openstack overcloud deploy --timeout 1440
----
+
The timeouts are now set.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-08-17 07:21:30 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Bug Depends On: | |||||||
Bug Blocks: | 1823324, 1823334, 1823352 | ||||||
Attachments: |
|
Description
Sai Sindhur Malleni
2020-01-17 18:25:55 UTC
Eventually the deploy command exits (at the keystone token expiration time of 300m) + openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e /home/stack/containers-prepare-parameter.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/deploy.yaml -e /home/stack/templates/ceph.yaml -r /home/stack/templates/roles_data.yaml --ntp-server clock1.rdu2.redhat.com,clock.redhat.com,clock2.redhat.com --timeout 600 --config-download-only --config-download-timeout 600 Waiting for messages on queue 'tripleo' with no timeout. Exception occured while running the command Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 32, in run super(Command, self).run(parsed_args) File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run return super(Command, self).run(parsed_args) File "/usr/lib/python3.6/site-packages/cliff/command.py", line 185, in run return_code = self.take_action(parsed_args) or 0 File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 973, in take_action in_flight_validations=parsed_args.inflight) File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/deployment.py", line 370, in config_download for payload in base.wait_for_messages(workflow_client, ws, execution): File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/base.py", line 61, in wait_for_messages for payload in websocket.wait_for_messages(timeout=timeout): File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 153, in wait_for_messages message = self.recv() File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 131, in recv return json.loads(self._ws.recv()) File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) Expecting value: line 1 column 1 (char 0) real 300m2.725s user 0m50.602s sys 0m3.900s Looking at the mistral log, we indeed see 2020-01-17 15:13:18.780 8 DEBUG mistral.engine.workflows [-] Workflow completed [id=f0fb634c-6982-44b8-bc08-2674f28efc6d] check_and_complete /usr/lib/python3.6/site-packages/mistral/engine/workflows.py:454 2020-01-17 15:13:18.805 8 DEBUG mistral.expressions.yaql_expression [-] Start to evaluate YAQL expression. [expression='<% $.deployment_status %>', context={}] evaluate /usr/lib/python3.6/site-packages/mistral/expressions/yaql_expression.py:149 2020-01-17 15:13:18.807 8 DEBUG mistral.expressions.yaql_expression [-] Finished evaluation. [expression='<% $.deployment_status %>', result: {'deployment_status': 'DEPLOY_SUCCESS', 'workflow_status': {'payload': {'deployment_status': 'DEPLOY_SUCCESS', 'execution_id': 'c3b65341-208a-47ad-b66c-8721046facfa', 'message': 'Ansible passed.', 'plan_name': 'overcloud', 'root_execution_id': None, 'status': 'SUCCESS'}, 'type': 'tripleo.deployment.v1.config_download_deploy'}}] evaluate /usr/lib/python3.6/site-packages/mistral/expressions/yaql_expression.py:168 Performance patching for ansible.cfg did not help to mitigate the overcloud hang issue for after 150+ node count. The issue still persists. Please refer to comment #4 with the linked doc for further review. BR, Pradipta I have reviewed the information and have not found a solution yet. The current workaround is to manually run the config-download playbooks. At face value, this looks to be an issue with TripleO's usage of the config-download Mistral workflow. Created attachment 1655739 [details]
Related log files during a hang that happens around 1:09
Attached are the relevant log files. I personally could not find any evidence in the logs to support why this may be happening.
According to others, there have been some rare issues like this in the past that have not been seen in master (Ussuri) since switching to Ansible Runner for running Ansible commands. That means this might be python-tripleoclient related.
Just to clarify things related to Mistral. The limitation that you are hitting is likely the 4 hour execution limit (limited by the length of the Keystone token). Mistral doesn't refresh the token and will start to error. The only workaround that I know of is to increase the timeout in keystone. However, I can understand why that might not be popular. This limitation has been around for a long time. Fixing it in Mistral would likely be non-trivial. There is some discussion about doing it the following bug but I am not aware of any serious effort to tackle it; https://bugs.launchpad.net/mistral/+bug/1595084 But looks like the command hangs well before the keystone timeout (we configured it to 300m) with no output while ansible is running in the background, and eventually exits at the token expiration time. (In reply to Sai Sindhur Malleni from comment #16) > But looks like the command hangs well before the keystone timeout (we > configured it to 300m) with no output while ansible is running in the > background, and eventually exits at the token expiration time. "the command hangs" - I assume that means tripleoclient hangs? Which usually happens when it is waiting for a message from Zaqar. So we need to figure out what Mistral is doing at that point. I'd be happy to help debug a live env if that would be useful. *** Bug 1818866 has been marked as a duplicate of this bug. *** The fixes are available in the latest Puddle. Use these parameters for the Undercloud (not Overcloud) to increase the timeout of Keystone authentication tokens and the HAProxy timeout for Zaqar. Be sure to include the full path to the YAML file in the `undercloud.conf` file for the `custom_env_files` option before running an `openstack undercloud install` again to load the changes. The example below sets these timeouts to 24 hours. This will prevent the deployment from hanging after 4 hours. ``` --- parameter_defaults: TokenExpiration: 86400 ZaqarWsTimeout: 86400 ``` On the next Overcloud deployment, the command should also have an increased timeout (in minutes) set via the CLI: ``` $ openstack overcloud deploy --timeout 1440 ``` Created a custom_env_files in undercloud.conf that pointed to a yaml containing: parameter_defaults: TokenExpiration: 86400 ZaqarWsTimeout: 86400 Undercloud was successfully deployed. In addition that scale team was able to successfully test patch in a scale environment. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284 |