Bug 1927356 - [rhosp16.1] overcloud deployment hangs however ansible is complete
Summary: [rhosp16.1] overcloud deployment hangs however ansible is complete
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-tripleoclient
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: z4
: 16.1 (Train on RHEL 8.2)
Assignee: Rabi Mishra
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-10 15:35 UTC by Ketan Mehta
Modified: 2024-06-14 00:14 UTC (History)
11 users (show)

Fixed In Version: python-tripleoclient-12.3.2-1.20201114043247.1.el8ost tripleo-ansible-0.5.1-1.20201114030848.2.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-17 15:36:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 776612 0 None MERGED [Train] Bump queue subscription ttl for WebsocketClient 2021-03-03 12:16:23 UTC
OpenStack gerrit 776878 0 None MERGED Revert "Remove json_error callback" 2021-03-03 12:16:24 UTC
Red Hat Issue Tracker OSP-3100 0 None None None 2022-08-23 18:52:25 UTC
Red Hat Product Errata RHBA-2021:0817 0 None None None 2021-03-17 15:37:15 UTC

Description Ketan Mehta 2021-02-10 15:35:09 UTC
Description of problem:

Hello,

For large clouds with more than 130+ nodes the stack create/update is either complete erroneously or hangs without any output or generation of overcloudrc file.

Part 1

The issue reported earlier looked like:

+--------------------------------------+-----------------+--------+------------------------+----------------+-----------+
(undercloud) $ openstack stack list
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
| ID                                   | Stack Name | Project                          | Stack Status    | Creation Time        | Updated Time |
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
| 9b4c9e80-d605-442a-8945-3691d60e3eb9 | overcloud  | 06823da80cc04fab952e0a0d03f5b344 | CREATE_COMPLETE | 2021-01-27T21:45:54Z | None         |
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
(undercloud) $ openstack overcloud status
+-----------+-------------------+
| Plan Name | Deployment Status |
+-----------+-------------------+
| overcloud |   DEPLOY_SUCCESS  |
+-----------+-------------------+

so, in actual the deployment is complete but it ends with:

Timed out waiting for messages from Execution (ID: 9a5e1227-be08-4902-a17d-4ee31a392ae3, State: RUNNING). The WebSocket timed out before the Workflow completed.
Host <> not found in /home/stack/.ssh/known_hosts
Overcloud Endpoint: https://<url>:13000
Overcloud Horizon Dashboard URL: https://<url>:443/dashboard
Overcloud rc file: /home/stack/overcloudrc
Overcloud Deployed with error

The openstack overcloud failures command listed:

 $ openstack overcloud failures
|-> Failures for host: undercloud
|--> Task: Run tripleo-container-image-prepare logged to: /var/log/tripleo-container-image-prepare.log
|---> censored: "the output has been hidden due to the fact that 'no_log: true' was specified for this result"
|---> changed: true

We later found that there were a lot of stale executions in ERROR & RUNNING state which were cleaned up before running another deploy.

~Timed out waiting for messages from Execution (ID: d8029597-3fe9-4425-8d37-d85e5003c574, State: RUNNING). The WebSocket timed out before the Workflow completed.
Overcloud Endpoint: https://<>:13000
Overcloud Horizon Dashboard URL: https://<>:443/dashboard
Overcloud rc file: /home/stack/overcloudrc
Overcloud Deployed with error

Part 2:

We then increased the keystone, zaqar and overcloud deploy timeout as per [1] and ran the update again with respective values of 86400s, 86400s and 1440m.

This time however the ansible got completed but the overcloud deploy command was in hung state when I last checked the status on call, 24 hours timeout was yet to be complete.

Version-Release number of selected component (if applicable):

openstack-tripleo-heat-templates-11.3.2-1.20200914170177.el8ost.noarch

Comment 1 Ketan Mehta 2021-02-10 15:36:30 UTC
[1] - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/release_notes/chap-technical_notes#rhea_2020_4284_red_hat_openstack_platform_16_1_2_general_availability_advisory

On the undercloud node, included this env file and ran undercloud install again.

parameter_defaults:
  TokenExpiration: 86400
  ZaqarWsTimeout: 86400

Overcloud deploy:

openstack overcloud deploy --timeout 1440 <with-other-env-files>

Comment 7 David Rosenfeld 2021-02-17 21:10:36 UTC
Set qe_test_coverage flag to - because problem was seen with 130 node deployment that is larger than can be included in Phase 3 regression. Also,  the solution may be to use --quiet workaround.

Comment 39 David Rosenfeld 2021-03-08 13:33:11 UTC
Scale test with 130 nodes can't be performed, but failure and fix can be simulated. The ttl value in /usr/lib/python3.6/site-packages/tripleoclient/plugin.py was lowered to from 43200 to 1000 and a deploy performed. In this case:

openstack stack list - showed CREATE_COMPLETE for stack overcloud
openstack overcloud status - showed DEPLOY_SUCCESS
overcloud_install.log - Overcloud Deployed with error

That verifies overcloud deploy uses the defined ttl and would use the new value: 'ttl': 43200 that is now included in plugin.py


In addition the openstack overcloud failures command no longer shows any errors:


(undercloud) [stack@undercloud-0 ~]$ openstack overcloud failures
(undercloud) [stack@undercloud-0 ~]$

Comment 44 errata-xmlrpc 2021-03-17 15:36:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0817


Note You need to log in before you can comment on or make changes to this bug.