Bug 1727947

Summary: [OSP13] Sporadic error 504 when running a deploy on large clouds
Product: Red Hat OpenStack Reporter: Alex Schultz <aschultz>
Component: python-tripleoclientAssignee: Alex Schultz <aschultz>
Status: CLOSED ERRATA QA Contact: Sasha Smolyak <ssmolyak>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: hbrock, jslagle, mburns
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-tripleoclient-9.2.7-11.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-03 16:55:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Schultz 2019-07-08 15:59:12 UTC
This bug was initially created as a copy of Bug #1674054

I am copying this bug because: 



Description of problem:
During a deploy, there's a heatclient.poll_for_events being called every once in a while. In large environment, getting the events can be quite long [1] and results in the tripleoclient quitting because haproxy returns a 504, because we break the 2m timeout.

We can workaround this issue by changing the haproxy timeout, but this is not desirable because it often gets forgotten when managing multiple clouds.

We believe that tripleo shouldn't be failing when getting a 504, and it should instead retry. Either that or we should optimize the way we poll for events.


Version-Release number of selected component (if applicable):
python-heatclient-1.5.2-1.el7ost.noarch                     Thu May 24 18:44:01 2018
python-tripleoclient-5.4.6-1.el7ost.noarch                  Mon Feb  4 18:51:14 2019


How reproducible:
All the time

Steps to Reproduce:
1. Have hundreds of compute nodes
2. Deploy

Actual results:
tripleoclient will exit with a 504 because GET /v1/19235c66a3cc45c4a58349e1448a9d40/stacks/overcloud/4a40a99e-a258-440a-a5fe-ad3e276e30b1/resources?nested_depth=5 takes more than 2 minutes to complete. 

[1]
~~~
heat-api.log:2019-02-06 06:43:16.157 33257 DEBUG oslo_policy._cache_handler [req-19e6935e-cd27-4b68-9aa8-637de9226ac4 51eefe5b76b2405f990106af93c1c252 19235c66a3cc45c4a58349e1448a9d40 - default default] Reloading cached file /etc/heat/policy.json read_cached_file /usr/lib/python2.7/site-packages/oslo_policy/_cache_handler.py:38
heat-api.log:2019-02-06 06:43:16.202 33257 DEBUG oslo_policy.policy [req-19e6935e-cd27-4b68-9aa8-637de9226ac4 51eefe5b76b2405f990106af93c1c252 19235c66a3cc45c4a58349e1448a9d40 - default default] Reloaded policy file: /etc/heat/policy.json _load_policy_file /usr/lib/python2.7/site-packages/oslo_policy/policy.py:584
heat-api.log:2019-02-06 06:43:16.203 33257 DEBUG heat.common.wsgi [req-19e6935e-cd27-4b68-9aa8-637de9226ac4 51eefe5b76b2405f990106af93c1c252 19235c66a3cc45c4a58349e1448a9d40 - default default] Calling <heat.api.openstack.v1.resources.ResourceController object at 0x7f6b71bde290> : index __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:839
heat-api.log:2019-02-06 06:43:16.204 33257 DEBUG oslo_messaging._drivers.amqpdriver [req-19e6935e-cd27-4b68-9aa8-637de9226ac4 51eefe5b76b2405f990106af93c1c252 19235c66a3cc45c4a58349e1448a9d40 - default default] CALL msg_id: 3baa80f7d602487da77b5be8daf14887 exchange 'heat' topic 'engine' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:568
heat-api.log:2019-02-06 06:45:41.162 33257 INFO eventlet.wsgi.server [req-19e6935e-cd27-4b68-9aa8-637de9226ac4 51eefe5b76b2405f990106af93c1c252 19235c66a3cc45c4a58349e1448a9d40 - default default] 192.168.8.1 - - [06/Feb/2019 06:45:41] "GET /v1/19235c66a3cc45c4a58349e1448a9d40/stacks/overcloud/4a40a99e-a258-440a-a5fe-ad3e276e30b1/resources?nested_depth=5 HTTP/1.1" 200 8436480 145.009166
heat-api.log:2019-02-06 06:45:41.900 33257 DEBUG heat.api.middleware.version_negotiation [req-19e6935e-cd27-4b68-9aa8-637de9226ac4 51eefe5b76b2405f990106af93c1c252 19235c66a3cc45c4a58349e1448a9d40 - default default] Processing request: GET /v1/19235c66a3cc45c4a58349e1448a9d40/stacks/4a40a99e-a258-440a-a5fe-ad3e276e30b1/events Accept: application/json process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:50
heat-api.log:2019-02-06 06:45:41.900 33257 DEBUG heat.api.middleware.version_negotiation [req-19e6935e-cd27-4b68-9aa8-637de9226ac4 51eefe5b76b2405f990106af93c1c252 19235c66a3cc45c4a58349e1448a9d40 - default default] Matched versioned URI. Version: 1.0 process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:65
~~~
Expected results:
We should either optimize this, or tripleoclient shouldn't fail

Additional info:

Comment 10 errata-xmlrpc 2019-09-03 16:55:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2624