Description of problem: overcloud deployment received 504 Gateway Time-out Steps to Reproduce: 1. deploy undercloud, 2. upload images 3. introspect nodes, tagging 4. prepare overcloud , etc In this case a period 40 + hours transpired before kicking off the overcloud deploy when we get a 504 timeout. 5. . deploy overcloud looked at undercloud containers and found none failed Deploy command: #!/bin/bash openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ -e /home/stack/virt/config_lvm.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/inject-trust-anchor.yaml \ -e /home/stack/virt/hostnames.yml \ -e /home/stack/virt/debug.yaml \ -e /home/stack/virt/nodes_data.yaml \ -e /home/stack/virt/docker-images.yaml \ --log-file overcloud_deployment_46.log Actual results: bash overcloud_deploy.sh <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> Expected results: Overcloud deployed successfully
i'd start by checking haproxy logs, and re-running the deployment command with --debug which will hopefully show what service it's attempting to communicate with when the 504 error is thrown. otherwise, i'd suggest heat, keystone, mistral, and swift logs as a first pass.
I did reproduce this error during one of my osp14 deployments. Full traceback is: clean_up DeployOvercloud: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run ret_val = super(OpenStackShell, self).run(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 281, in run result = self.run_subcommand(remainder) File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 175, in run_subcommand ret_value = super(OpenStackShell, self).run_subcommand(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run return super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run return_code = self.take_action(parsed_args) or 0 File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 867, in take_action self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 358, in _deploy_tripleo_heat_templates_tmpdir new_tht_root, tht_root) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 451, in _deploy_tripleo_heat_templates parsed_args.plan_environment_file) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 466, in _try_overcloud_deploy_with_compat_yaml plan_env_file) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 215, in _heat_deploy stack_name, env, moved_files, tht_root) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 291, in _process_and_upload_environment parameters=params) File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/parameters.py", line 20, in update_parameters **input_) File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/base.py", line 25, in call_action save_result=True, run_sync=True) File "/usr/lib/python2.7/site-packages/mistralclient/api/v2/action_executions.py", line 44, in create dump_json=True File "/usr/lib/python2.7/site-packages/mistralclient/api/base.py", line 97, in _create self._raise_api_exception(ex.response) File "/usr/lib/python2.7/site-packages/mistralclient/api/base.py", line 160, in _raise_api_exception error_message=error_data) APIException: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> END return value: 1 Looking at /var/log/messages, we see a few occurrences of 504 errors from haproxy logging: 1425406:Sep 20 08:11:46 osp14 haproxy[9]: 192.168.24.2:44778 [20/Sep/2018:08:09:46.797] mistral~ mistral/osp14.internalapi.localdomain 7/0/0/-1/120007 504 194 - - sH-- 64/0/0/0/0 0/0 "POST /v2/action_executions HTTP/1.1" The "sH" is the log code for the session state: s : the server-side timeout expired while waiting for the server to send or receive data. H : the proxy was waiting for complete, valid response HEADERS from the server (HTTP only). We hit an haproxy timeout waiting for the mistral server backend to respond.
I just hit this, too, with a tripleo-quickstart deployment of rocky. The initial deployment went fine, but failed (504 gateway timeout) when I tried doing an update a couple days later. Rebooting the undercloud (I reached quickly for a big hammer) resolved the problem for me.
I was working on re-producing this to get logs. Alan, if you have logs as James , perhaps you can add them to this bug report. Thanks!
The issue in my environment at least was due to an undersized undercloud. I was using a virtual undercloud with only 2vcpus and 16G ram. Once I bumped that up to 4vcpus I didn't see it again. Please let me know those specs on any environment you are able to reproduce.
note that our documented minimum is 8 cores and 16GB ram. Let me know if you were using at least that when the issue occurred.
My undercloud VM has 6 cores and 24GB of RAM. Unfortunately, I don't have logs and my setup is in a delicate state while I finish work on something else. If the 504 gateway timeout occurs again, then I'll spend time to poke at it a bit and collect logs.
Whoops, didn't mean to clear the needinfo request for mlammon's specs.
Our deployments have been using (4) cpu for undercloud. will need to test with this newer requirement.
Infrared patch to bump the undercloud vCPU number to 8: https://review.gerrithub.io/#/c/redhat-openstack/infrared/+/426387/
this appears to be a problem with mistral as restarting the mistral containers fixes the issue. we've seen the same upstream as well: https://bugs.launchpad.net/tripleo/+bug/1789680
reproduced issue again today after sitting idle for over 2 days, with higher number of cpu (8) RESP: [504] Cache-Control: no-cache Connection: close Content-Type: text/html [37/344] RESP BODY: Omitted, Content-Type is set to text/html. Only application/json responses have their bodies logged. Request returned failure status: 504 <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run return super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run return_code = self.take_action(parsed_args) or 0 File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 867, in take_action self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 358, in _deploy_tripleo_heat_templates_tmpdir new_tht_root, tht_root) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 370, in _deploy_tripleo_heat_templates plans = plan_management.list_deployment_plans(self.clients) File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/plan_management.py", line 130, in list_deployment_plans workflow_input=workflow_input File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/base.py", line 39, in start_workflow workflow_input=workflow_input File "/usr/lib/python2.7/site-packages/mistralclient/api/v2/executions.py", line 65, in create return self._create('/executions', data) File "/usr/lib/python2.7/site-packages/mistralclient/api/base.py", line 97, in _create self._raise_api_exception(ex.response) File "/usr/lib/python2.7/site-packages/mistralclient/api/base.py", line 160, in _raise_api_exception error_message=error_data) APIException: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> clean_up DeployOvercloud: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> END return value: 1
Here's one example of the traceback in mistral API logs: 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Caught a messaging remote error. See details of the original exception.: MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base Traceback (most recent call last): 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 99, in decorator 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base return method(*args, **kwargs) 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/clients.py", line 141, in start_workflow 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base params=params 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base **kwargs 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base retry=self.retry) 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base retry=retry) 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base call_monitor_timeout, retry=retry) 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base call_monitor_timeout) 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base message = self.waiters.get(msg_id, timeout=timeout) 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base 'to message ID %s' % msg_id) 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b 2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base 2018-09-25 05:07:52.019 17110 ERROR wsme.api [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Server-side error: "MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b". Detail: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction result = f(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/mistral/api/controllers/v2/execution.py", line 277, in post **result_exec_dict.get('params', {}) File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 115, in decorator _wrap_exception_and_reraise(e) File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 84, in _wrap_exception_and_reraise raise exc.MistralException(message) MistralException: MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b : MistralException: MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b
Oh. 2018-09-25 05:05:52.006 17110 DEBUG mistral.api.controllers.v2.execution [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Create execution [execution=Execution [id='Unset', workflow_id='Unset', workflow_name='tripleo.plan_management.v1.list_plans', workflow_namespace='Unset', description='', params='Unset', task_execution_id='Unset', root_execution_id='Unset', source_execution_id='Unset', state='Unset', state_info='Unset', input='Unset', output='Unset', created_at='Unset', updated_at='Unset', project_id='Unset']] post /usr/lib/python2.7/site-packages/mistral/api/controllers/v2/execution.py:225 2018-09-25 05:05:52.006 17110 DEBUG mistral.api.controllers.v2.execution [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Generated execution id [exec_id=1ee833f2-3083-49e5-8464-fca0f0148dba] post /usr/lib/python2.7/site-packages/mistral/api/controllers/v2/execution.py:233 2018-09-25 05:05:52.008 17110 WARNING oslo.messaging._drivers.impl_rabbit [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Process forked after connection established! This can result in unpredictable behavior. See: https://docs.openstack.org/oslo.messaging/latest/reference/transport.html 2018-09-25 05:05:52.073 1 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : ab6192e8d76343359806d97aa8ea732b
I think I found the root cause. https://review.openstack.org/605633 for a potential fix.
*** Bug 1636412 has been marked as a duplicate of this bug. ***
Mike, Can you verify this bug? The patch has landed and was cherry picked to rocky, queens, and pike.
This can now be marked verified. Environment (undercloud) [stack@undercloud-0 ~]$ yum list | grep openstack-mistral openstack-mistral-all.noarch 7.0.3-0.20180919133910.4f3792f.el7ost openstack-mistral-api.noarch 7.0.3-0.20180919133910.4f3792f.el7ost openstack-mistral-common.noarch 7.0.3-0.20180919133910.4f3792f.el7ost openstack-mistral-engine.noarch 7.0.3-0.20180919133910.4f3792f.el7ost openstack-mistral-event-engine.noarch openstack-mistral-executor.noarch (undercloud) [stack@undercloud-0 ~]$ yum info openstack-mistral-all Loaded plugins: search-disabled-repos Available Packages Name : openstack-mistral-all Arch : noarch Version : 7.0.3 Release : 0.20180919133910.4f3792f.el7ost Size : 3.6 k Repo : rhelosp-14.0-puddle/x86_64 Summary : OpenStack Mistral All-in-one daemon URL : https://launchpad.net/mistral License : ASL 2.0 Description : OpenStack Mistral All service. : . : This package contains the mistral api, engine, and executor service as : an all-in-one process. Tested fix and no longer able to reproduce. Steps: 1. deploy undercloud, 2. upload images 3. introspect nodes, tagging 4. prepare overcloud , etc In this case a period 24 hours transpired before kicking off the overcloud deploy to see if we get the error. 5. . deploy overcloud
Seeing this with OSP 14 puddle 2018-10-30.6. Has the fix landed, yet?
I checked the code on the mistral_api container and it contained the patch referenced in this BZ, yet I am seeing the following errors in mistral api logs Any pointers? 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base Traceback (most recent call last): 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 99, in decorator 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base return method(*args, **kwargs) 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/clients.py", line 163, in start_action 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base params=params 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base **kwargs 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base retry=self.retry) 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base retry=retry) 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base call_monitor_timeout, retry=retry) 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base call_monitor_timeout) 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base message = self.waiters.get(msg_id, timeout=timeout) 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base 'to message ID %s' % msg_id) 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base MessagingTimeout: Timed out waiting for a reply to message ID 202190faccde450a98ce863ad1c9ff36 2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base 2018-11-04 22:11:28.909 26 ERROR wsme.api [req-15dd9128-e5f0-4016-840a-313c5171aff4 f75430b12a2247cf990a7a1e25e81b5c d9b2c711458e44eca60f592ebb81ed31 - default default] Server-side error: "MessagingTimeout: Timed out waiting for a reply to message ID 202190faccde450a98ce863ad1c9ff36". Detail: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction result = f(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/mistral/api/controllers/v2/action_execution.py", line 167, in post **params File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 115, in decorator _wrap_exception_and_reraise(e) File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 84, in _wrap_exception_and_reraise raise exc.MistralException(message) MistralException: MessagingTimeout: Timed out waiting for a reply to message ID 202190faccde450a98ce863ad1c9ff36
Based on above comments, can we reopen the BZ?
Thanks for the calrification Brad, how do I verify if it is the same underlying issue? Is this what I should be looking for in logs? "No calling threads waiting for msg_id"
Thanks to therve, it is a different issue with a misleading error. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1646415 for that.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045
My work around: 1. change /etc/hosts on directoty. Attach directory hostname to 127.0.0.1 like this: "127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 osp16-director-01 osp16-director-01.example.com" 2. Restart systemctl restart tripleo_rabbitmq.service
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days