Bug 1628319
| Summary: | overcloud deployment received 504 Gateway Time-out from mistral | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | mlammon | |
| Component: | openstack-mistral | Assignee: | RHOS Maint <rhos-maint> | |
| Status: | CLOSED ERRATA | QA Contact: | mlammon | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 14.0 (Rocky) | CC: | abishop, agurenko, brad, bschmaus, jinzha, jjoyce, jlibosva, jschluet, jslagle, jtaleric, mbarnett, mburns, mcornea, mlammon, ohochman, rbrady, rhos-maint, slinaber, smalleni, therve, tvignaud, yprokule | |
| Target Milestone: | beta | Keywords: | Triaged | |
| Target Release: | 14.0 (Rocky) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-mistral-7.0.3-0.20180919133910.4f3792f.el7ost | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1638350 1638474 (view as bug list) | Environment: | ||
| Last Closed: | 2019-01-11 11:53:06 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1638474 | |||
|
Description
mlammon
2018-09-12 18:12:52 UTC
i'd start by checking haproxy logs, and re-running the deployment command with --debug which will hopefully show what service it's attempting to communicate with when the 504 error is thrown. otherwise, i'd suggest heat, keystone, mistral, and swift logs as a first pass. I did reproduce this error during one of my osp14 deployments.
Full traceback is:
clean_up DeployOvercloud: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
ret_val = super(OpenStackShell, self).run(argv)
File "/usr/lib/python2.7/site-packages/cliff/app.py", line 281, in run
result = self.run_subcommand(remainder)
File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 175, in run_subcommand
ret_value = super(OpenStackShell, self).run_subcommand(argv)
File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
result = cmd.run(parsed_args)
File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
super(Command, self).run(parsed_args)
File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
return super(Command, self).run(parsed_args)
File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
return_code = self.take_action(parsed_args) or 0
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 867, in take_action
self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 358, in _deploy_tripleo_heat_templates_tmpdir
new_tht_root, tht_root)
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 451, in _deploy_tripleo_heat_templates
parsed_args.plan_environment_file)
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 466, in _try_overcloud_deploy_with_compat_yaml
plan_env_file)
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 215, in _heat_deploy
stack_name, env, moved_files, tht_root)
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 291, in _process_and_upload_environment
parameters=params)
File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/parameters.py", line 20, in update_parameters
**input_)
File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/base.py", line 25, in call_action
save_result=True, run_sync=True)
File "/usr/lib/python2.7/site-packages/mistralclient/api/v2/action_executions.py", line 44, in create
dump_json=True
File "/usr/lib/python2.7/site-packages/mistralclient/api/base.py", line 97, in _create
self._raise_api_exception(ex.response)
File "/usr/lib/python2.7/site-packages/mistralclient/api/base.py", line 160, in _raise_api_exception
error_message=error_data)
APIException: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
END return value: 1
Looking at /var/log/messages, we see a few occurrences of 504 errors from haproxy logging:
1425406:Sep 20 08:11:46 osp14 haproxy[9]: 192.168.24.2:44778 [20/Sep/2018:08:09:46.797] mistral~ mistral/osp14.internalapi.localdomain 7/0/0/-1/120007 504 194 - - sH-- 64/0/0/0/0 0/0 "POST /v2/action_executions HTTP/1.1"
The "sH" is the log code for the session state:
s : the server-side timeout expired while waiting for the server to
send or receive data.
H : the proxy was waiting for complete, valid response HEADERS from the
server (HTTP only).
We hit an haproxy timeout waiting for the mistral server backend to respond.
I just hit this, too, with a tripleo-quickstart deployment of rocky. The initial deployment went fine, but failed (504 gateway timeout) when I tried doing an update a couple days later. Rebooting the undercloud (I reached quickly for a big hammer) resolved the problem for me. I was working on re-producing this to get logs. Alan, if you have logs as James , perhaps you can add them to this bug report. Thanks! The issue in my environment at least was due to an undersized undercloud. I was using a virtual undercloud with only 2vcpus and 16G ram. Once I bumped that up to 4vcpus I didn't see it again. Please let me know those specs on any environment you are able to reproduce. note that our documented minimum is 8 cores and 16GB ram. Let me know if you were using at least that when the issue occurred. My undercloud VM has 6 cores and 24GB of RAM. Unfortunately, I don't have logs and my setup is in a delicate state while I finish work on something else. If the 504 gateway timeout occurs again, then I'll spend time to poke at it a bit and collect logs. Whoops, didn't mean to clear the needinfo request for mlammon's specs. Our deployments have been using (4) cpu for undercloud. will need to test with this newer requirement. Infrared patch to bump the undercloud vCPU number to 8: https://review.gerrithub.io/#/c/redhat-openstack/infrared/+/426387/ this appears to be a problem with mistral as restarting the mistral containers fixes the issue. we've seen the same upstream as well: https://bugs.launchpad.net/tripleo/+bug/1789680 reproduced issue again today after sitting idle for over 2 days, with higher number of cpu (8)
RESP: [504] Cache-Control: no-cache Connection: close Content-Type: text/html [37/344]
RESP BODY: Omitted, Content-Type is set to text/html. Only application/json responses have their bodies logged.
Request returned failure status: 504
<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
result = cmd.run(parsed_args)
File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
super(Command, self).run(parsed_args)
File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
return super(Command, self).run(parsed_args)
File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
return_code = self.take_action(parsed_args) or 0
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 867, in take_action
self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 358, in _deploy_tripleo_heat_templates_tmpdir
new_tht_root, tht_root)
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 370, in _deploy_tripleo_heat_templates
plans = plan_management.list_deployment_plans(self.clients)
File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/plan_management.py", line 130, in list_deployment_plans
workflow_input=workflow_input
File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/base.py", line 39, in start_workflow
workflow_input=workflow_input
File "/usr/lib/python2.7/site-packages/mistralclient/api/v2/executions.py", line 65, in create
return self._create('/executions', data)
File "/usr/lib/python2.7/site-packages/mistralclient/api/base.py", line 97, in _create
self._raise_api_exception(ex.response)
File "/usr/lib/python2.7/site-packages/mistralclient/api/base.py", line 160, in _raise_api_exception
error_message=error_data)
APIException: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
clean_up DeployOvercloud: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
END return value: 1
Here's one example of the traceback in mistral API logs:
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Caught a messaging remote error. See details of the original exception.: MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base Traceback (most recent call last):
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 99, in decorator
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base return method(*args, **kwargs)
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/clients.py", line 141, in start_workflow
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base params=params
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base **kwargs
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base retry=self.retry)
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base retry=retry)
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base call_monitor_timeout, retry=retry)
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base call_monitor_timeout)
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base message = self.waiters.get(msg_id, timeout=timeout)
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base 'to message ID %s' % msg_id)
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b
2018-09-25 05:07:52.017 17110 ERROR mistral.rpc.base
2018-09-25 05:07:52.019 17110 ERROR wsme.api [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Server-side error: "MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b". Detail:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
result = f(self, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/mistral/api/controllers/v2/execution.py", line 277, in post
**result_exec_dict.get('params', {})
File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 115, in decorator
_wrap_exception_and_reraise(e)
File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 84, in _wrap_exception_and_reraise
raise exc.MistralException(message)
MistralException: MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b
: MistralException: MessagingTimeout: Timed out waiting for a reply to message ID ab6192e8d76343359806d97aa8ea732b
Oh. 2018-09-25 05:05:52.006 17110 DEBUG mistral.api.controllers.v2.execution [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Create execution [execution=Execution [id='Unset', workflow_id='Unset', workflow_name='tripleo.plan_management.v1.list_plans', workflow_namespace='Unset', description='', params='Unset', task_execution_id='Unset', root_execution_id='Unset', source_execution_id='Unset', state='Unset', state_info='Unset', input='Unset', output='Unset', created_at='Unset', updated_at='Unset', project_id='Unset']] post /usr/lib/python2.7/site-packages/mistral/api/controllers/v2/execution.py:225 2018-09-25 05:05:52.006 17110 DEBUG mistral.api.controllers.v2.execution [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Generated execution id [exec_id=1ee833f2-3083-49e5-8464-fca0f0148dba] post /usr/lib/python2.7/site-packages/mistral/api/controllers/v2/execution.py:233 2018-09-25 05:05:52.008 17110 WARNING oslo.messaging._drivers.impl_rabbit [req-25311392-35f5-4132-ba1a-4bfe85cd4f4c 0dab91a92ca04d8095546db82b23af59 46018b30cd6a456b9f337e2f6fb74e38 - default default] Process forked after connection established! This can result in unpredictable behavior. See: https://docs.openstack.org/oslo.messaging/latest/reference/transport.html 2018-09-25 05:05:52.073 1 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : ab6192e8d76343359806d97aa8ea732b I think I found the root cause. https://review.openstack.org/605633 for a potential fix. *** Bug 1636412 has been marked as a duplicate of this bug. *** Mike, Can you verify this bug? The patch has landed and was cherry picked to rocky, queens, and pike. This can now be marked verified. Environment (undercloud) [stack@undercloud-0 ~]$ yum list | grep openstack-mistral openstack-mistral-all.noarch 7.0.3-0.20180919133910.4f3792f.el7ost openstack-mistral-api.noarch 7.0.3-0.20180919133910.4f3792f.el7ost openstack-mistral-common.noarch 7.0.3-0.20180919133910.4f3792f.el7ost openstack-mistral-engine.noarch 7.0.3-0.20180919133910.4f3792f.el7ost openstack-mistral-event-engine.noarch openstack-mistral-executor.noarch (undercloud) [stack@undercloud-0 ~]$ yum info openstack-mistral-all Loaded plugins: search-disabled-repos Available Packages Name : openstack-mistral-all Arch : noarch Version : 7.0.3 Release : 0.20180919133910.4f3792f.el7ost Size : 3.6 k Repo : rhelosp-14.0-puddle/x86_64 Summary : OpenStack Mistral All-in-one daemon URL : https://launchpad.net/mistral License : ASL 2.0 Description : OpenStack Mistral All service. : . : This package contains the mistral api, engine, and executor service as : an all-in-one process. Tested fix and no longer able to reproduce. Steps: 1. deploy undercloud, 2. upload images 3. introspect nodes, tagging 4. prepare overcloud , etc In this case a period 24 hours transpired before kicking off the overcloud deploy to see if we get the error. 5. . deploy overcloud Seeing this with OSP 14 puddle 2018-10-30.6. Has the fix landed, yet? I checked the code on the mistral_api container and it contained the patch referenced in this BZ, yet I am seeing the following errors in mistral api logs
Any pointers?
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base Traceback (most recent call last):
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 99, in decorator
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base return method(*args, **kwargs)
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/clients.py", line 163, in start_action
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base params=params
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base **kwargs
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base retry=self.retry)
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base retry=retry)
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base call_monitor_timeout, retry=retry)
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base call_monitor_timeout)
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base message = self.waiters.get(msg_id, timeout=timeout)
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base 'to message ID %s' % msg_id)
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base MessagingTimeout: Timed out waiting for a reply to message ID 202190faccde450a98ce863ad1c9ff36
2018-11-04 22:11:28.905 26 ERROR mistral.rpc.base
2018-11-04 22:11:28.909 26 ERROR wsme.api [req-15dd9128-e5f0-4016-840a-313c5171aff4 f75430b12a2247cf990a7a1e25e81b5c d9b2c711458e44eca60f592ebb81ed31 - default default] Server-side error: "MessagingTimeout: Timed out waiting for a reply to message ID 202190faccde450a98ce863ad1c9ff36". Detail:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
result = f(self, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/mistral/api/controllers/v2/action_execution.py", line 167, in post
**params
File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 115, in decorator
_wrap_exception_and_reraise(e)
File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 84, in _wrap_exception_and_reraise
raise exc.MistralException(message)
MistralException: MessagingTimeout: Timed out waiting for a reply to message ID 202190faccde450a98ce863ad1c9ff36
Based on above comments, can we reopen the BZ? Thanks for the calrification Brad, how do I verify if it is the same underlying issue? Is this what I should be looking for in logs? "No calling threads waiting for msg_id" Thanks to therve, it is a different issue with a misleading error. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1646415 for that. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045 My work around: 1. change /etc/hosts on directoty. Attach directory hostname to 127.0.0.1 like this: "127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 osp16-director-01 osp16-director-01.example.com" 2. Restart systemctl restart tripleo_rabbitmq.service The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |