Hide Forgot
Description of problem: after deploying OSP 10 HA configuration with Jenkins and leaving it overnight, all overcloud api calls times out. Version-Release number of selected component (if applicable): 10 (deployed on September 5th) How reproducible: Deploy with Jenkins' OSPD-Customized-Deployment HA configuration (controller:3,compute:2,ceph:3) with version 10:default:latest. After 12 hours APIs not responding Steps to Reproduce: 1. ssh to undercloud-0 2. source overcloudrc 3. nova list Actual results: [stack@undercloud-0 ~]$ nova list No handlers could be found for logger "keystoneauth.identity.generic.base" ERROR (GatewayTimeout): Gateway Timeout (HTTP 504) Expected results: Get a list of nova nodes Additional info: httpd seems to be stuck; ps axu | grep httpd | wc -l shows 263 and looks like being limited by MaxClients parameter: grep -Ri MaxClients /etc/httpd/conf* /etc/httpd/conf.modules.d/prefork.conf: MaxClients 256 There is an upstream bug that seems to cover this issue.
carlos, can you take a look at this one?
please move to ASSIGNED when you're able to start work on it
Reproduced. Was able to interact with overcloud after bouncing httpd on all controllers.
Provided upstream fix: https://review.openstack.org/#/c/374136/
(In reply to Carlos Camacho from comment #5) > Provided upstream fix: https://review.openstack.org/#/c/374136/ carlos, thanks for the fix. can you add a link to the patch under External Trackers towards the top of this bug? and please move the bug to the ON_DEV state. thanks.
to summarize, this issue is isolated to low memory environments (8GB RAM). This is unlikely to cause issues in production.
(In reply to James Slagle from comment #9) > to summarize, this issue is isolated to low memory environments (8GB RAM). > This is unlikely to cause issues in production. Not sure how this was assumed but I can reproduce this problem on controllers with 32 GB of RAM which is our minimal recommendation. It seems that httpd spawned processes are hung. strace shows for most of the spawned httpd processes following: Process 32595 attached connect(24, {sa_family=AF_LOCAL, sun_path="/var/run/wsgi.9929.0.2.sock"}, 110 Setting MaxClients to 32 just makes this happen faster.
Hi, Last week I tried to reproduce this bug in my local environment without luck, I had the issue before the submitted upstream fix. But not anymore (following the official docs from tripleo.org), can you provide me with some feedback about any additional configuration that you might be deploying without the default parameters?
Issue is gone for last few builds. Last verified for 2016-10-25.2 build
Hi The issue is seen on many QE setups This is usually seen after about 12-24 hours Both BM & Virt setups Ofer
please add your output in the bug
Cannot reproduce the issue on 2016-11-2.2 build. Tried it for 2 days now, it looks stable to me. Had this issue on 2016-10-31.2 though.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html