Bug 1647566

Summary: overcloud deployment received 504 Gateway Time-out
Product: Red Hat OpenStack Reporter: Benjamin Schmaus <bschmaus>
Component: openstack-tripleo-commonAssignee: Adriano Petrich <apetrich>
Status: CLOSED ERRATA QA Contact: Artem Hrechanychenko <ahrechan>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: bschmaus, ccopello, emacchi, jjoyce, jschluet, mburns, mcornea, slinaber, tvignaud
Target Milestone: ---Keywords: TestOnly, Triaged, ZStream
Target Release: 14.0 (Rocky)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-common-9.4.1-0.20181012010880.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-18 13:03:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Deploy Logs
none
Templates used
none
Mistral Logs none

Description Benjamin Schmaus 2018-11-07 19:39:16 UTC
Description of problem: When trying to deploy overcloud with latest puddle: RH7-RHOS-14.0 2018-11-06.3  I end up with a gateway timeout error.




Version-Release number of selected component (if applicable):
OSP14

How reproducible:
100% in a DCI Red Hat hardware lab - can reproduce for anyone on the fly

Steps to Reproduce:
1. Install undercloud with TLS
2. Deploy overcloud with TLS using the following:
openstack overcloud deploy --templates --stack overcloud -e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible-external.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /home/stack/templates/custom.yaml -e /home/stack/templates/docker_registry.yaml -e /home/stack/templates/enable-tls.yaml -e /home/stack/templates/inject-trust-anchor.yaml -t 120 --debug 2>&1 | tee deploy.log


Actual results:
Error 504 and timeout.  Overcloud plan gets created but no heat stack is started and hence no overcloud gets deployed

Expected results:
Overcloud deployment should succeed

Additional info:

Comment 1 Benjamin Schmaus 2018-11-07 19:39:34 UTC
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server [req-a860645d-f26e-478d-924a-5e16d7351164 711cecebd3364fff84095d0d4e851397 b9de7c02071b4437879b5a00e8608329 - default default] Exception during message handling: MessagingTimeout: Timed out waiting for a reply to message ID c9a09aba5df04446851c10277f6a7c38
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/engine/engine_server.py", line 137, in start_action
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     **params
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/db/utils.py", line 88, in decorate
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     return retry.call(_with_auth_context, auth_ctx, func, *args, **kw)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/tenacity/__init__.py", line 330, in call
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     start_time=start_time)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/tenacity/__init__.py", line 279, in iter
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     return fut.result()
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/concurrent/futures/_base.py", line 422, in result
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     return self.__get_result()
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/tenacity/__init__.py", line 333, in call
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     result = fn(*args, **kwargs)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/db/utils.py", line 45, in _with_auth_context
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     return func(*args, **kw)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/engine/action_queue.py", line 94, in decorate
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     res = func(*args, **kw)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 104, in start_action
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     timeout=timeout)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/engine/actions.py", line 301, in run
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     timeout=timeout
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/rpc/clients.py", line 384, in run_action
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     return rpc_client_method(auth_ctx.ctx(), 'run_action', **rpc_kwargs)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     **kwargs
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     retry=self.retry)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     retry=retry)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     call_monitor_timeout, retry=retry)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     call_monitor_timeout)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     message = self.waiters.get(msg_id, timeout=timeout)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server     'to message ID %s' % msg_id)
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server MessagingTimeout: Timed out waiting for a reply to message ID c9a09aba5df04446851c10277f6a7c38
mistral/engine.log:2018-11-07 19:24:55.321 1 ERROR oslo_messaging.rpc.server 
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base [req-a860645d-f26e-478d-924a-5e16d7351164 711cecebd3364fff84095d0d4e851397 b9de7c02071b4437879b5a00e8608329 - default default] Caught a messaging remote error. See details of the original exception.: MessagingTimeout: Timed out waiting for a reply to message ID 166a68eb61cf41efbfbdf9cc13de5438
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base Traceback (most recent call last):
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 99, in decorator
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     return method(*args, **kwargs)
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/mistral/rpc/clients.py", line 163, in start_action
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     params=params
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     **kwargs
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     retry=self.retry)
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     retry=retry)
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     call_monitor_timeout, retry=retry)
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     call_monitor_timeout)
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     message = self.waiters.get(msg_id, timeout=timeout)
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base     'to message ID %s' % msg_id)
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base MessagingTimeout: Timed out waiting for a reply to message ID 166a68eb61cf41efbfbdf9cc13de5438
mistral/api.log:2018-11-07 19:24:55.236 26 ERROR mistral.rpc.base 
mistral/api.log:2018-11-07 19:24:55.245 26 ERROR wsme.api [req-a860645d-f26e-478d-924a-5e16d7351164 711cecebd3364fff84095d0d4e851397 b9de7c02071b4437879b5a00e8608329 - default default] Server-side error: "MessagingTimeout: Timed out waiting for a reply to message ID 166a68eb61cf41efbfbdf9cc13de5438". Detail:

Comment 2 Benjamin Schmaus 2018-11-07 19:43:08 UTC
Created attachment 1503104 [details]
Deploy Logs

Comment 3 Benjamin Schmaus 2018-11-07 19:43:33 UTC
Created attachment 1503105 [details]
Templates used

Comment 4 Benjamin Schmaus 2018-11-07 19:43:56 UTC
Created attachment 1503106 [details]
Mistral Logs

Comment 5 Thomas Hervé 2018-11-07 22:56:19 UTC
The issue is with tripleo.parameters.update action. It's run as a sync action, so it's subjec to default rpc timeout of Mistral (120s), but in this case it takes a little bit longer (126s to be precise in the logs we see).

It's hard to know why it's so long. The vast majority seems to be spent in the action itself and in Swift call. We can see a log:

Template: overcloud.yaml

And nothing until 75s later. Which means we spend this time on plan_utils.build_env_paths and plan_utils.build_env_paths, which sounds suspicious.

We need to either investigate the time and cut it down, or not call the action synchronously.

Comment 6 Benjamin Schmaus 2018-11-08 12:29:59 UTC
Just a note the undercloud in this case has 14 cores and 32gb of memory.

I'll be testing puddle: RH7-RHOS-14.0 2018-11-07.2 this morning and see if I get any different results.

Also is there anyway I can work around this in the meantime?

Comment 7 Thomas Hervé 2018-11-08 13:57:49 UTC
As a short work around you can raise rpc_response_timeout in mistral.conf to 240. Would it possible to access your env to debug it further?

Comment 8 Thomas Hervé 2018-11-08 20:55:58 UTC
Alright, I've pushed a bunch of fixes for that bug.

The hammer is  https://review.openstack.org/616556 which turns the action to a workflow. We won't care as much about the time.

https://review.openstack.org/616679 and https://review.openstack.org/616683 are 2 performances enhancements which ought to improve things. I wasn't able to test on the environment (it needs enough services enable to appear).

Comment 9 Benjamin Schmaus 2018-11-08 23:17:53 UTC
I attempted to set the rpc_response_timeout in /usr/share/openstack-tripleo-heat-templates/puppet/services/mistral-base.yaml

sed -i -e 's/mistral::rpc_response_timeout: 120/mistral::rpc_response_timeout: 720/g' /usr/share/openstack-tripleo-heat-templates/puppet/services/mistral-base.yaml

Then I initiated the undercloud install and moved onto the overcloud and still ended up with the gateway timeout.

It appears to be set:

[root@ks-oseosp2-ospd ~]# docker ps|grep mistral
5729c89ed5c6        10.19.139.251:5000/rhosp14/openstack-mistral-api:latest                 "kolla_start"            About an hour ago   Up About an hour (healthy)                         mistral_api
f72d59309e99        10.19.139.251:5000/rhosp14/openstack-mistral-engine:latest              "kolla_start"            About an hour ago   Up About an hour (healthy)                         mistral_engine
deab24685fd6        10.19.139.251:5000/rhosp14/openstack-mistral-event-engine:latest        "kolla_start"            About an hour ago   Up About an hour (healthy)                         mistral_event_engine
266454b91577        10.19.139.251:5000/rhosp14/openstack-mistral-executor:latest            "kolla_start"            About an hour ago   Up About an hour (healthy)                         mistral_executor
[root@ks-oseosp2-ospd ~]# docker exec -ti f72d59309e99 cat /etc/mistral/mistral.conf|grep rpc
rpc_response_timeout=720
[root@ks-oseosp2-ospd ~]# docker exec -ti deab24685fd6 cat /etc/mistral/mistral.conf|grep rpc
rpc_response_timeout=720
[root@ks-oseosp2-ospd ~]# docker exec -ti 266454b91577 cat /etc/mistral/mistral.conf|grep rpc
rpc_response_timeout=720
[root@ks-oseosp2-ospd ~]# docker exec -ti 5729c89ed5c6 cat /etc/mistral/mistral.conf|grep rpc
rpc_response_timeout=720
[root@ks-oseosp2-ospd ~]#

Comment 10 Thomas Hervé 2018-11-09 07:48:45 UTC
OK, I don't know then. I'd need to look with the change applied.

Comment 11 Benjamin Schmaus 2018-11-09 16:08:37 UTC
Thomas the patch you applied on the undercloud seemed to resolve the issue.  The deploy failed for another reason but that is outside of mistral :)

Can you provide the code bits you updated?

Comment 14 Artem Hrechanychenko 2019-01-28 10:49:02 UTC
Hi Benjamin,
can you provide a link how you configured TLS for Undercloud. I know how to configure TLS everywhere for Overcloud nodes but I don't know anything about TLS for Undercloud.
Thx!

Comment 15 Marius Cornea 2019-02-01 15:54:48 UTC
I wasn't able to reproduce this issue with the latest OSP14 build(openstack-tripleo-common-9.4.1-0.20181012010888.el7ost.noarch). Please re-open if you hit this issue again.

Comment 17 Mikey Ariel 2019-02-20 12:44:16 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Comment 20 errata-xmlrpc 2019-03-18 13:03:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0446