2213408 – wait_for_message_queue() in heat_launcher.py sets the aggressive timeout.

Bug 2213408 - wait_for_message_queue() in heat_launcher.py sets the aggressive timeout.

Summary: wait_for_message_queue() in heat_launcher.py sets the aggressive timeout.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-tripleoclient
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	z4
Target Release:	17.1
Assignee:	James Slagle
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2222869
TreeView+	depends on / blocked

Reported:	2023-06-08 05:19 UTC by Keigo Noha
Modified:	2024-11-21 09:32 UTC (History)
CC List:	7 users (show)
Fixed In Version:	python-tripleoclient-16.5.1-17.1.20240913100806.f3599d0.el9ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-11-21 09:32:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	885580	None	ABANDONED	Increase retry time to ephemeral Heat message queue	2024-03-04 12:22:15 UTC
Red Hat Issue Tracker	OSP-25678	None	None	None	2023-06-08 05:23:51 UTC
Red Hat Product Errata	RHSA-2024:9990	None	None	None	2024-11-21 09:32:19 UTC

Description Keigo Noha 2023-06-08 05:19:26 UTC

Description of problem:
wait_for_message_queue() in heat_launcher.py sets the aggressive timeout.

There is an environment that it consumed 30 seconds when ephemeral-heat launches its heat-engine workers.
~~~
2023-06-05 13:51:06.526 1 DEBUG heat-api-noauth [-] ******************************************************************************** log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2593
2023-06-05 13:51:06.527 1 DEBUG heat-api-noauth [-] Configuration options gathered from: log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2594
2023-06-05 13:51:06.527 1 DEBUG heat-api-noauth [-] command line args: ['--config-file', '/etc/heat/heat.conf'] log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2595
2023-06-05 13:51:06.527 1 DEBUG heat-api-noauth [-] config files: ['/etc/heat/heat.conf'] log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2596
2023-06-05 13:51:06.527 1 DEBUG heat-api-noauth [-] ================================================================================ log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2598
:
2023-06-05 13:51:06.558 1 INFO heat-api [-] Starting Heat REST API on 0.0.0.0:8006
2023-06-05 13:51:06.558 1 INFO heat.common.wsgi [-] Starting single process server
2023-06-05 13:51:06.559 1 INFO eventlet.wsgi.server [-] (1) wsgi starting up on http://0.0.0.0:8006
2023-06-05 13:51:06.599 1 WARNING heat.common.config [-] stack_user_domain_id or stack_user_domain_name 
:
2023-06-05 13:51:07.110 1 DEBUG oslo_concurrency.lockutils [-] Acquired lock "singleton_lock" lock /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:266
2023-06-05 13:51:07.110 1 DEBUG oslo_concurrency.lockutils [-] Releasing lock "singleton_lock" lock /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:282
2023-06-05 13:51:07.110 1 INFO oslo_service.service [-] Starting 16 workers
:
2023-06-05 13:51:07.195 1 DEBUG oslo_service.service [-] ******************************************************************************** log_opt_values /usr/lib/python3.9/site-packages/oslo_config/cfg.py:2617
2023-06-05 13:51:37.152 2 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID1.
2023-06-05 13:51:37.158 5 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID2.
2023-06-05 13:51:37.163 6 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID3.
2023-06-05 13:51:37.166 7 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID4.
2023-06-05 13:51:37.168 3 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID5.
2023-06-05 13:51:37.169 8 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID6.
2023-06-05 13:51:37.169 9 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID7.
2023-06-05 13:51:37.173 10 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID8.
2023-06-05 13:51:37.173 11 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID9.
2023-06-05 13:51:37.179 4 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID10.
2023-06-05 13:51:37.181 13 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID11.
2023-06-05 13:51:37.182 14 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID12.
2023-06-05 13:51:37.183 12 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID13.
2023-06-05 13:51:37.184 15 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID14.
2023-06-05 13:51:37.187 16 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID15.
2023-06-05 13:51:37.188 17 INFO heat.engine.worker [-] Starting engine_worker (1.4) in engine UUID16.
~~~

However, wait_for_message_queue() assumes that ephemeral-heat will launches and create a queue within 10 seconds.
~~~
    @retry(retry=retry_if_exception_type(HeatPodMessageQueueException),
           reraise=True,
           stop=(stop_after_delay(10) | stop_after_attempt(10)),
           wait=wait_fixed(0.5))
    def wait_for_message_queue(self):
        queue_name = 'engine.' + EPHEMERAL_HEAT_POD_NAME
        output = subprocess.check_output([
            'sudo', 'podman', 'exec', 'rabbitmq',
            'rabbitmqctl', 'list_queues'])
        if str(output).count(queue_name) < 1:
            msg = "Message queue for ephemeral heat not created in time."
            raise HeatPodMessageQueueException(msg)
~~~

Also, wait time, 0.5 seconds seems to be aggressive.
I think we should increase the amount of the value for wait_fixed() and stop_after_delay() or it should be configurable.

Version-Release number of selected component (if applicable):
OSP17.0

How reproducible:
Everytime when they run overcloud without the rabbitmq queue.

Steps to Reproduce:
1. Create a undrecloud and run overcloud deploy.
2.
3.

Actual results:
overcloud deploy failed.

Expected results:
overcloud deploy succeeed.

Additional info:

Comment 1 James Slagle 2023-06-08 11:03:58 UTC

I started an upstream patch to change the wait to 60s with a retry every 1s.
https://review.opendev.org/c/openstack/python-tripleoclient/+/885580

Comment 2 Keigo Noha 2023-06-09 02:05:17 UTC

Hi James,

Thank you for your work on this bugzilla and upstream. From support side, the same inquiry may be raised from many customers once we ship OSP17.1.
Could you please add the change into OSP17.1 GA as an exception?

Best regards,
Keigo Noha

Comment 24 errata-xmlrpc 2024-11-21 09:32:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHOSP 17.1.4 (openstack-tripleo-common and python-tripleoclient) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:9990

Note You need to log in before you can comment on or make changes to this bug.