Bug 1372751 - Gateway Timeout (HTTP 504) requiring httpd restart
Summary: Gateway Timeout (HTTP 504) requiring httpd restart
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 10.0 (Newton)
Assignee: Pradeep Kilambi
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-02 14:40 UTC by Alex Krzos
Modified: 2016-12-14 15:56 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-heat-templates-5.0.0-0.20161003064637.d636e3a.1.1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-14 15:56:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Graph displaying all three controller's counts of httpd processes (when the overcloud has no load applied) (67.27 KB, image/png)
2016-09-07 13:49 UTC, Alex Krzos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:2948 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 enhancement update 2016-12-14 19:55:27 UTC

Description Alex Krzos 2016-09-02 14:40:31 UTC
Description of problem:
Deployed brand new overcloud in running job over night to discover a gateway timeout on the overcloud in the morning requiring a restart of httpd on all controllers to get a working overcloud again.  If I deploy a new overcloud and I am available to begin testing it I typically do not see this issue.

If a cloud is unused for several hours it will reappear.  Clouds under load (some sort of browbeat config running against them do not exhibit this issue)

[stack@undercloud ~]$ . overcloudrc
[stack@undercloud ~]$ openstack endpoint list
Discovering versions from the identity service failed when creating the password plugin. Attempting to determine version from URL.
Gateway Timeout (HTTP 504)

...
[root@overcloud-controller-0 log]# systemctl restart httpd
...
[root@overcloud-controller-1 log]# systemctl restart httpd
...
[root@overcloud-controller-2 log]# systemctl restart httpd
...

[stack@undercloud ~]$ openstack endpoint list
+----------------------------------+-----------+--------------+-----------------+
| ID                               | Region    | Service Name | Service Type    |
+----------------------------------+-----------+--------------+-----------------+
| 2ee2c5cbd0754634b301be50dd05628c | regionOne | keystone     | identity        |
....
+----------------------------------+-----------+--------------+-----------------+



Version-Release number of selected component (if applicable):
builds 2016-08-29.1, 2016-08-30.1
Possibly other builds as other members of the performance team have also run into this.

How reproducible:
This appears to take some time to occur.  It always occurs whenever I build a cloud overnight so I have a fresh one in the morning to test against.  I initially suspected something wrong with the haproxy configuration however I am not an haproxy expert.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Note I filed this against tripleo because I suspect it is a configuration issue with the overcloud's haproxy or httpd.  It could be something else so please adjust accordingly.

Comment 2 Alex Krzos 2016-09-06 17:43:08 UTC
> Clouds under load (some sort of browbeat config running against them do not exhibit this issue)

This is no longer true.  I have now seen this appear on clouds under load as well.

Comment 3 Alex Krzos 2016-09-07 13:48:36 UTC
Reviewing more data on this issue shows that httpd slowly grows its number of processes until it hits MaxClients/ServerLimit (defaults to 256).  A graph of the httpd processes makes it pretty clear that the processes are not closing or terminating.

In another test on ceilometer with mongo as a backend, this issue does not occur so I suspect it is something to do with gnocchi as the configured backend but not fully configured to run (see this bug https://bugzilla.redhat.com/show_bug.cgi?id=1372508)

I have since disabled the gnocchi api and began an intensive keystone test that would always be rudely interrupted by this timeout after several hours of bench-marking to see if this occurs still.

Comment 4 Alex Krzos 2016-09-07 13:49:40 UTC
Created attachment 1198749 [details]
Graph displaying all three controller's counts of httpd processes (when the overcloud has no load applied)

Comment 5 Alex Krzos 2016-09-07 17:53:28 UTC
As a workaround, disabling gnocchi api prevents the condition where httpd grows in processes.  There is several errors in the gnocchi api log as well indicating the service is most likely not completely setup and thus causing the condition in this bug.

Comment 6 Sai Sindhur Malleni 2016-09-08 17:10:08 UTC
+1 Affecting my OSP 10 environment too, Needing an HTTPD restart every few hours.

Comment 9 Julien Danjou 2016-10-20 13:48:32 UTC
So this may have been fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1372821 but due to the scale and load testing Alex has used, we can't test that at our level of QA.

Alex, can you retest this and move it to VERIFIED is everything is OK now?

Comment 10 Alex Krzos 2016-10-21 11:54:40 UTC
This bug is actually a symptom of the haproxy config missing the AUTH statment (https://bugzilla.redhat.com/show_bug.cgi?id=1371657) Fixing the haproxy config or disabling gnocchi-api results in httpd no longer growing in processes on the builds that exhabited this issue.

Regardless of the above, with the haproxy config fixed, I have not been able to reproduce this issue with the latest OSP10 builds that I have worked with: 
2016-09-20.2
2016-10-06.1
2016-10-18.2

Thus I am marking it as verified.

Comment 12 errata-xmlrpc 2016-12-14 15:56:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html


Note You need to log in before you can comment on or make changes to this bug.