Bug 2213408

Summary: wait_for_message_queue() in heat_launcher.py sets the aggressive timeout.
Product: Red Hat OpenStack Reporter: Keigo Noha <knoha>
Component: python-tripleoclientAssignee: James Slagle <jslagle>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: high Docs Contact:
Priority: medium    
Version: 17.0 (Wallaby)CC: cmuresan, dhughes, jslagle, ltoscano, mariel, mburns, ramishra
Target Milestone: z4Keywords: Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: python-tripleoclient-16.5.1-17.1.20240913100806.f3599d0.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-11-21 09:32:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2222869    

Description Keigo Noha 2023-06-08 05:19:26 UTC
Description of problem:
wait_for_message_queue() in heat_launcher.py sets the aggressive timeout.

There is an environment that it consumed 30 seconds when ephemeral-heat launches its heat-engine workers.
~~~
2023-06-05 13:51:06.526 1 DEBUG heat-api-noauth [-] ******************************************************************************** log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2593
2023-06-05 13:51:06.527 1 DEBUG heat-api-noauth [-] Configuration options gathered from: log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2594
2023-06-05 13:51:06.527 1 DEBUG heat-api-noauth [-] command line args: ['--config-file', '/etc/heat/heat.conf'] log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2595
2023-06-05 13:51:06.527 1 DEBUG heat-api-noauth [-] config files: ['/etc/heat/heat.conf'] log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2596
2023-06-05 13:51:06.527 1 DEBUG heat-api-noauth [-] ================================================================================ log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2598
:
2023-06-05 13:51:06.558 1 INFO heat-api [-] Starting Heat REST API on 0.0.0.0:8006
2023-06-05 13:51:06.558 1 INFO heat.common.wsgi [-] Starting single process server
2023-06-05 13:51:06.559 1 INFO eventlet.wsgi.server [-] (1) wsgi starting up on http://0.0.0.0:8006
2023-06-05 13:51:06.599 1 WARNING heat.common.config [-] stack_user_domain_id or stack_user_domain_name 
:
2023-06-05 13:51:07.110 1 DEBUG oslo_concurrency.lockutils [-] Acquired lock "singleton_lock" lock /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:266
2023-06-05 13:51:07.110 1 DEBUG oslo_concurrency.lockutils [-] Releasing lock "singleton_lock" lock /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:282
2023-06-05 13:51:07.110 1 INFO oslo_service.service [-] Starting 16 workers
:
2023-06-05 13:51:07.195 1 DEBUG oslo_service.service [-] ******************************************************************************** log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2617
2023-06-05 13:51:37.152 2 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID1.
2023-06-05 13:51:37.158 5 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID2.
2023-06-05 13:51:37.163 6 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID3.
2023-06-05 13:51:37.166 7 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID4.
2023-06-05 13:51:37.168 3 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID5.
2023-06-05 13:51:37.169 8 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID6.
2023-06-05 13:51:37.169 9 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID7.
2023-06-05 13:51:37.173 10 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID8.
2023-06-05 13:51:37.173 11 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID9.
2023-06-05 13:51:37.179 4 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID10.
2023-06-05 13:51:37.181 13 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID11.
2023-06-05 13:51:37.182 14 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID12.
2023-06-05 13:51:37.183 12 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID13.
2023-06-05 13:51:37.184 15 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID14.
2023-06-05 13:51:37.187 16 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID15.
2023-06-05 13:51:37.188 17 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID16.
~~~

However, wait_for_message_queue() assumes that ephemeral-heat will launches and create a queue within 10 seconds.
~~~
    @retry(retry=retry_if_exception_type(HeatPodMessageQueueException),
           reraise=True,
           stop=(stop_after_delay(10) | stop_after_attempt(10)),
           wait=wait_fixed(0.5))
    def wait_for_message_queue(self):
        queue_name = 'engine.' + EPHEMERAL_HEAT_POD_NAME
        output = subprocess.check_output([
            'sudo', 'podman', 'exec', 'rabbitmq',
            'rabbitmqctl', 'list_queues'])
        if str(output).count(queue_name) < 1:
            msg = "Message queue for ephemeral heat not created in time."
            raise HeatPodMessageQueueException(msg)
~~~

Also, wait time, 0.5 seconds seems to be aggressive.
I think we should increase the amount of the value for wait_fixed() and stop_after_delay() or it should be configurable.

Version-Release number of selected component (if applicable):
OSP17.0

How reproducible:
Everytime when they run overcloud without the rabbitmq queue.

Steps to Reproduce:
1. Create a undrecloud and run overcloud deploy.
2.
3.

Actual results:
overcloud deploy failed.

Expected results:
overcloud deploy succeeed.

Additional info:

Comment 1 James Slagle 2023-06-08 11:03:58 UTC
I started an upstream patch to change the wait to 60s with a retry every 1s.
https://review.opendev.org/c/openstack/python-tripleoclient/+/885580

Comment 2 Keigo Noha 2023-06-09 02:05:17 UTC
Hi James,

Thank you for your work on this bugzilla and upstream. From support side, the same inquiry may be raised from many customers once we ship OSP17.1.
Could you please add the change into OSP17.1 GA as an exception?

Best regards,
Keigo Noha

Comment 24 errata-xmlrpc 2024-11-21 09:32:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHOSP 17.1.4 (openstack-tripleo-common and python-tripleoclient) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:9990