Description of problem: Deployed brand new overcloud in running job over night to discover a gateway timeout on the overcloud in the morning requiring a restart of httpd on all controllers to get a working overcloud again. If I deploy a new overcloud and I am available to begin testing it I typically do not see this issue. If a cloud is unused for several hours it will reappear. Clouds under load (some sort of browbeat config running against them do not exhibit this issue) [stack@undercloud ~]$ . overcloudrc [stack@undercloud ~]$ openstack endpoint list Discovering versions from the identity service failed when creating the password plugin. Attempting to determine version from URL. Gateway Timeout (HTTP 504) ... [root@overcloud-controller-0 log]# systemctl restart httpd ... [root@overcloud-controller-1 log]# systemctl restart httpd ... [root@overcloud-controller-2 log]# systemctl restart httpd ... [stack@undercloud ~]$ openstack endpoint list +----------------------------------+-----------+--------------+-----------------+ | ID | Region | Service Name | Service Type | +----------------------------------+-----------+--------------+-----------------+ | 2ee2c5cbd0754634b301be50dd05628c | regionOne | keystone | identity | .... +----------------------------------+-----------+--------------+-----------------+ Version-Release number of selected component (if applicable): builds 2016-08-29.1, 2016-08-30.1 Possibly other builds as other members of the performance team have also run into this. How reproducible: This appears to take some time to occur. It always occurs whenever I build a cloud overnight so I have a fresh one in the morning to test against. I initially suspected something wrong with the haproxy configuration however I am not an haproxy expert. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Note I filed this against tripleo because I suspect it is a configuration issue with the overcloud's haproxy or httpd. It could be something else so please adjust accordingly.
> Clouds under load (some sort of browbeat config running against them do not exhibit this issue) This is no longer true. I have now seen this appear on clouds under load as well.
Reviewing more data on this issue shows that httpd slowly grows its number of processes until it hits MaxClients/ServerLimit (defaults to 256). A graph of the httpd processes makes it pretty clear that the processes are not closing or terminating. In another test on ceilometer with mongo as a backend, this issue does not occur so I suspect it is something to do with gnocchi as the configured backend but not fully configured to run (see this bug https://bugzilla.redhat.com/show_bug.cgi?id=1372508) I have since disabled the gnocchi api and began an intensive keystone test that would always be rudely interrupted by this timeout after several hours of bench-marking to see if this occurs still.
Created attachment 1198749 [details] Graph displaying all three controller's counts of httpd processes (when the overcloud has no load applied)
As a workaround, disabling gnocchi api prevents the condition where httpd grows in processes. There is several errors in the gnocchi api log as well indicating the service is most likely not completely setup and thus causing the condition in this bug.
+1 Affecting my OSP 10 environment too, Needing an HTTPD restart every few hours.
So this may have been fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1372821 but due to the scale and load testing Alex has used, we can't test that at our level of QA. Alex, can you retest this and move it to VERIFIED is everything is OK now?
This bug is actually a symptom of the haproxy config missing the AUTH statment (https://bugzilla.redhat.com/show_bug.cgi?id=1371657) Fixing the haproxy config or disabling gnocchi-api results in httpd no longer growing in processes on the builds that exhabited this issue. Regardless of the above, with the haproxy config fixed, I have not been able to reproduce this issue with the latest OSP10 builds that I have worked with: 2016-09-20.2 2016-10-06.1 2016-10-18.2 Thus I am marking it as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html