Description of problem: Hello, For large clouds with more than 130+ nodes the stack create/update is either complete erroneously or hangs without any output or generation of overcloudrc file. Part 1 The issue reported earlier looked like: +--------------------------------------+-----------------+--------+------------------------+----------------+-----------+ (undercloud) $ openstack stack list +--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+ | ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+ | 9b4c9e80-d605-442a-8945-3691d60e3eb9 | overcloud | 06823da80cc04fab952e0a0d03f5b344 | CREATE_COMPLETE | 2021-01-27T21:45:54Z | None | +--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+ (undercloud) $ openstack overcloud status +-----------+-------------------+ | Plan Name | Deployment Status | +-----------+-------------------+ | overcloud | DEPLOY_SUCCESS | +-----------+-------------------+ so, in actual the deployment is complete but it ends with: Timed out waiting for messages from Execution (ID: 9a5e1227-be08-4902-a17d-4ee31a392ae3, State: RUNNING). The WebSocket timed out before the Workflow completed. Host <> not found in /home/stack/.ssh/known_hosts Overcloud Endpoint: https://<url>:13000 Overcloud Horizon Dashboard URL: https://<url>:443/dashboard Overcloud rc file: /home/stack/overcloudrc Overcloud Deployed with error The openstack overcloud failures command listed: $ openstack overcloud failures |-> Failures for host: undercloud |--> Task: Run tripleo-container-image-prepare logged to: /var/log/tripleo-container-image-prepare.log |---> censored: "the output has been hidden due to the fact that 'no_log: true' was specified for this result" |---> changed: true We later found that there were a lot of stale executions in ERROR & RUNNING state which were cleaned up before running another deploy. ~Timed out waiting for messages from Execution (ID: d8029597-3fe9-4425-8d37-d85e5003c574, State: RUNNING). The WebSocket timed out before the Workflow completed. Overcloud Endpoint: https://<>:13000 Overcloud Horizon Dashboard URL: https://<>:443/dashboard Overcloud rc file: /home/stack/overcloudrc Overcloud Deployed with error Part 2: We then increased the keystone, zaqar and overcloud deploy timeout as per [1] and ran the update again with respective values of 86400s, 86400s and 1440m. This time however the ansible got completed but the overcloud deploy command was in hung state when I last checked the status on call, 24 hours timeout was yet to be complete. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-11.3.2-1.20200914170177.el8ost.noarch
[1] - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/release_notes/chap-technical_notes#rhea_2020_4284_red_hat_openstack_platform_16_1_2_general_availability_advisory On the undercloud node, included this env file and ran undercloud install again. parameter_defaults: TokenExpiration: 86400 ZaqarWsTimeout: 86400 Overcloud deploy: openstack overcloud deploy --timeout 1440 <with-other-env-files>
Set qe_test_coverage flag to - because problem was seen with 130 node deployment that is larger than can be included in Phase 3 regression. Also, the solution may be to use --quiet workaround.
Scale test with 130 nodes can't be performed, but failure and fix can be simulated. The ttl value in /usr/lib/python3.6/site-packages/tripleoclient/plugin.py was lowered to from 43200 to 1000 and a deploy performed. In this case: openstack stack list - showed CREATE_COMPLETE for stack overcloud openstack overcloud status - showed DEPLOY_SUCCESS overcloud_install.log - Overcloud Deployed with error That verifies overcloud deploy uses the defined ttl and would use the new value: 'ttl': 43200 that is now included in plugin.py In addition the openstack overcloud failures command no longer shows any errors: (undercloud) [stack@undercloud-0 ~]$ openstack overcloud failures (undercloud) [stack@undercloud-0 ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0817