Bug 1373395

Summary: overcloud api do not respond after some time
Product: Red Hat OpenStack Reporter: Gurenko Alex <agurenko>
Component: openstack-tripleo-heat-templatesAssignee: Carlos Camacho <ccamacho>
Status: CLOSED ERRATA QA Contact: Gurenko Alex <agurenko>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: agurenko, ahirshbe, ccamacho, dbecker, jcoufal, jschluet, jslagle, mburns, mcornea, mkrcmari, morazi, oblaut, rduartes, rhel-osp-director-maint, sasha, sclewis
Target Milestone: betaKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-5.0.0-0.20160907212643.90c852e.2.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1406417 (view as bug list) Environment:
Last Closed: 2016-12-14 15:57:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gurenko Alex 2016-09-06 07:47:08 UTC
Description of problem: after deploying OSP 10 HA configuration with Jenkins and leaving it overnight, all overcloud api calls times out.


Version-Release number of selected component (if applicable):
10 (deployed on September 5th)

How reproducible: 
 Deploy with Jenkins' OSPD-Customized-Deployment HA configuration (controller:3,compute:2,ceph:3) with version 10:default:latest. After 12 hours APIs not responding

Steps to Reproduce:
 1. ssh to undercloud-0
 2. source overcloudrc
 3. nova list

Actual results:
[stack@undercloud-0 ~]$ nova list
No handlers could be found for logger "keystoneauth.identity.generic.base"
ERROR (GatewayTimeout): Gateway Timeout (HTTP 504)


Expected results:
 
 Get a list of nova nodes

Additional info:
 httpd seems to be stuck; ps axu | grep httpd  | wc -l shows 263 and looks like being limited by MaxClients parameter:

grep -Ri MaxClients /etc/httpd/conf*
/etc/httpd/conf.modules.d/prefork.conf:  MaxClients          256

There is an upstream bug that seems to cover this issue.

Comment 2 James Slagle 2016-09-13 20:27:36 UTC
carlos, can you take a look at this one?

Comment 3 James Slagle 2016-09-13 20:28:07 UTC
please move to ASSIGNED when you're able to start work on it

Comment 4 Alexander Chuzhoy 2016-09-14 13:35:07 UTC
Reproduced.
Was able to interact with overcloud after bouncing httpd on all controllers.

Comment 5 Carlos Camacho 2016-09-21 15:07:04 UTC
Provided upstream fix: https://review.openstack.org/#/c/374136/

Comment 6 James Slagle 2016-09-22 13:47:24 UTC
(In reply to Carlos Camacho from comment #5)
> Provided upstream fix: https://review.openstack.org/#/c/374136/

carlos, thanks for the fix. can you add a link to the patch under External Trackers towards the top of this bug? and please move the bug to the ON_DEV state. thanks.

Comment 9 James Slagle 2016-09-22 16:21:31 UTC
to summarize, this issue is isolated to low memory environments (8GB RAM). This is unlikely to cause issues in production.

Comment 10 Marian Krcmarik 2016-09-23 00:17:00 UTC
(In reply to James Slagle from comment #9)
> to summarize, this issue is isolated to low memory environments (8GB RAM).
> This is unlikely to cause issues in production.

Not sure how this was assumed but I can reproduce this problem on controllers with 32 GB of RAM which is our minimal recommendation.
It seems that httpd spawned processes are hung.
strace shows for most of the spawned httpd processes following:
Process 32595 attached
connect(24, {sa_family=AF_LOCAL, sun_path="/var/run/wsgi.9929.0.2.sock"}, 110
Setting MaxClients to 32 just makes this happen faster.

Comment 11 Carlos Camacho 2016-10-10 13:01:32 UTC
Hi,

Last week I tried to reproduce this bug in my local environment without luck, I had the issue before the submitted upstream fix. But not anymore (following the official docs from tripleo.org), can you provide me with some feedback about any additional configuration that you might be deploying without the default parameters?

Comment 12 Gurenko Alex 2016-10-27 08:51:41 UTC
Issue is gone for last few builds. Last verified for 2016-10-25.2 build

Comment 13 Ofer Blaut 2016-11-03 06:53:17 UTC
Hi

The issue is seen on many QE setups

This is usually seen  after about 12-24 hours 

Both BM & Virt setups 

Ofer

Comment 14 Ofer Blaut 2016-11-03 16:15:22 UTC
please add your output in the bug

Comment 15 Gurenko Alex 2016-11-06 06:16:15 UTC
Cannot reproduce the issue on 2016-11-2.2 build. Tried it for 2 days now, it looks stable to me. Had this issue on 2016-10-31.2 though.

Comment 17 errata-xmlrpc 2016-12-14 15:57:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html