Bug 1792500

Summary: Config-download: Ansible completed but overcloud deploy command hangs
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: openstack-tripleo-heat-templatesAssignee: Luke Short <lshort>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: bdobreli, cjeanner, dmacpher, drosenfe, emacchi, gregraka, jamsmith, kecarter, lshort, mburns, mivollme, moddi, morazi, nchandek, psahoo, ravsingh, sclewis, slinaber, spower
Target Milestone: z2Keywords: Documentation, Reopened, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200811043436.02c63d2.el8 Doc Type: Bug Fix
Doc Text:
Inadequate timeout values can cause an overcloud deployment to fail after four hours. To prevent these timeout failures, set the following undercloud and overcloud timeout parameters: ** Undercloud timeouts (seconds): + .Example + ---- parameter_defaults: TokenExpiration: 86400 ZaqarWsTimeout: 86400 ---- ** Overcloud deploy timeouts (minutes): + .Example + ---- $ openstack overcloud deploy --timeout 1440 ---- + The timeouts are now set.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-17 07:21:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1823324, 1823334, 1823352    
Description Flags
Related log files during a hang that happens around 1:09 none

Description Sai Sindhur Malleni 2020-01-17 18:25:55 UTC
Description of problem:
When scaling the overcloud from 100 to 150 nodes, we encountered a failure after stack create and during running ansible. We re-kicked the deployment using openstack overcloud deploy --config-download-only at around 10:50 UTC Jan 17 2020.

The deployment seems to have completed based on looking at the ansible.log but the overcloud deploy command hands and eventually times out at the keystone expiration timeout.

Deploy command
 openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e /home/stack/containers-prepare-parameter.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/deploy.yaml -e /home/stack/templates/ceph.yaml -r /home/stack/templates/roles_data.yaml --ntp-server clock1.rdu2.redhat.com,clock.redhat.com,clock2.redhat.com --timeout 600 --config-download-only --config-download-timeout 600     

Version-Release number of selected component (if applicable):

How reproducible:
100% at this scale

Steps to Reproduce:
1. Deploy overcloud ~100 nodes
2. Overcloud deploy command is stuck but ansible completes

Actual results:
Ansible completed but we do not see the Ansible Passed and overcloudrc updation message

Expected results:
Overcloud deploy command should also update overcloudrc and return successfully once ansible completes

Additional info:

Comment 2 Sai Sindhur Malleni 2020-01-17 19:10:22 UTC
Eventually the deploy command exits (at the keystone token expiration time of 300m)
+ openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e /home/stack/containers-prepare-parameter.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/deploy.yaml -e /home/stack/templates/ceph.yaml -r /home/stack/templates/roles_data.yaml --ntp-server clock1.rdu2.redhat.com,clock.redhat.com,clock2.redhat.com --timeout 600 --config-download-only --config-download-timeout 600                                                                    
Waiting for messages on queue 'tripleo' with no timeout.

Exception occured while running the command
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 32, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/cliff/command.py", line 185, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 973, in take_action
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/deployment.py", line 370, in config_download
    for payload in base.wait_for_messages(workflow_client, ws, execution):
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/base.py", line 61, in wait_for_messages
    for payload in websocket.wait_for_messages(timeout=timeout):
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 153, in wait_for_messages
    message = self.recv()
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 131, in recv
    return json.loads(self._ws.recv())
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)

real    300m2.725s
user    0m50.602s
sys     0m3.900s

Comment 3 Sai Sindhur Malleni 2020-01-17 19:12:10 UTC
Looking at the mistral log, we indeed see
2020-01-17 15:13:18.780 8 DEBUG mistral.engine.workflows [-] Workflow completed [id=f0fb634c-6982-44b8-bc08-2674f28efc6d] check_and_complete /usr/lib/python3.6/site-packages/mistral/engine/workflows.py:454
2020-01-17 15:13:18.805 8 DEBUG mistral.expressions.yaql_expression [-] Start to evaluate YAQL expression. [expression='<% $.deployment_status %>', context={}] evaluate /usr/lib/python3.6/site-packages/mistral/expressions/yaql_expression.py:149
2020-01-17 15:13:18.807 8 DEBUG mistral.expressions.yaql_expression [-] Finished evaluation. [expression='<% $.deployment_status %>', result: {'deployment_status': 'DEPLOY_SUCCESS', 'workflow_status': {'payload': {'deployment_status': 'DEPLOY_SUCCESS', 'execution_id': 'c3b65341-208a-47ad-b66c-8721046facfa', 'message': 'Ansible passed.', 'plan_name': 'overcloud', 'root_execution_id': None, 'status': 'SUCCESS'}, 'type': 'tripleo.deployment.v1.config_download_deploy'}}] evaluate /usr/lib/python3.6/site-packages/mistral/expressions/yaql_expression.py:168

Comment 4 Pradipta Kumar Sahoo 2020-01-19 18:31:03 UTC
Performance patching for ansible.cfg did not help to mitigate the overcloud hang issue for after 150+ node count. The issue still persists.
Please refer to comment #4 with the linked doc for further review.


Comment 6 Luke Short 2020-01-21 17:44:04 UTC
I have reviewed the information and have not found a solution yet. The current workaround is to manually run the config-download playbooks. At face value, this looks to be an issue with TripleO's usage of the config-download Mistral workflow.

Comment 12 Luke Short 2020-01-27 18:55:21 UTC
Created attachment 1655739 [details]
Related log files during a hang that happens around 1:09

Attached are the relevant log files. I personally could not find any evidence in the logs to support why this may be happening.

According to others, there have been some rare issues like this in the past that have not been seen in master (Ussuri) since switching to Ansible Runner for running Ansible commands. That means this might be python-tripleoclient related.

Comment 15 Dougal Matthews 2020-01-28 16:01:27 UTC
Just to clarify things related to Mistral. The limitation that you are hitting is likely the 4 hour execution limit (limited by the length of the Keystone token). Mistral doesn't refresh the token and will start to error. 

The only workaround that I know of is to increase the timeout in keystone. However, I can understand why that might not be popular.

This limitation has been around for a long time. Fixing it in Mistral would likely be non-trivial. There is some discussion about doing it the following bug but I am not aware of any serious effort to tackle it; https://bugs.launchpad.net/mistral/+bug/1595084

Comment 16 Sai Sindhur Malleni 2020-01-28 16:08:03 UTC
But looks like the command hangs well before the keystone timeout (we configured it to 300m) with no output while ansible is running in the background, and eventually exits at the token expiration time.

Comment 17 Dougal Matthews 2020-01-28 16:24:08 UTC
(In reply to Sai Sindhur Malleni from comment #16)
> But looks like the command hangs well before the keystone timeout (we
> configured it to 300m) with no output while ansible is running in the
> background, and eventually exits at the token expiration time.

"the command hangs" - I assume that means tripleoclient hangs? Which usually happens when it is waiting for a message from Zaqar. So we need to figure out what Mistral is doing at that point.

I'd be happy to help debug a live env if that would be useful.

Comment 19 Luke Short 2020-04-14 20:34:12 UTC
*** Bug 1818866 has been marked as a duplicate of this bug. ***

Comment 26 Luke Short 2020-08-11 18:38:23 UTC
The fixes are available in the latest Puddle. Use these parameters for the Undercloud (not Overcloud) to increase the timeout of Keystone authentication tokens and the HAProxy timeout for Zaqar. Be sure to include the full path to the YAML file in the `undercloud.conf` file for the `custom_env_files` option before running an `openstack undercloud install` again to load the changes. The example below sets these timeouts to 24 hours. This will prevent the deployment from hanging after 4 hours.

  TokenExpiration: 86400
  ZaqarWsTimeout: 86400

Comment 27 Luke Short 2020-08-11 18:47:44 UTC
On the next Overcloud deployment, the command should also have an increased timeout (in minutes) set via the CLI:

$ openstack overcloud deploy --timeout 1440

Comment 30 David Rosenfeld 2020-09-21 18:28:00 UTC
Created a custom_env_files in undercloud.conf that pointed to a yaml containing:

  TokenExpiration: 86400
  ZaqarWsTimeout: 86400

Undercloud was successfully deployed. In addition that scale team was able to successfully test patch in a scale environment.

Comment 40 errata-xmlrpc 2020-10-28 15:36:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.