Bug 1470363

Summary:	[Neutron] Keystone authentication issues
Product:	Red Hat OpenStack	Reporter:	Joe Talerico <jtaleric>
Component:	openstack-tripleo	Assignee:	Brent Eagles <beagles>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Arik Chernetsky <achernet>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	12.0 (Pike)	CC:	amuller, beagles, jtaleric, mburns, racedoro, rhel-osp-director-maint, rohara
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:	PerfScale
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-26 18:18:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joe Talerico 2017-07-12 19:53:58 UTC

I set openstack-tripleo - as I am not sure if this is a issue with :

A) How Neutron is creating tokens
B) How HAProxy is configured
C) How Keystone is configured (not enough workers?)

Also, with Pike Fernet tokens are also in-use instead of UUID (previously).

Description of problem:

Running through Performance scenarios we ran against the previous version(s) we see:
2017-07-12 18:20:31.007 122345 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 408: RequestTimeout: Request Timeout (HTTP 408)
2017-07-12 18:20:31.007 122345 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>408 Request Time-out</h1>
Your browser didn't send a complete request in time.
</body></html>
: RequestTimeout: Request Timeout (HTTP 408)
2017-07-12 18:20:31.007 122345 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server: ServiceError: Failed to fetch token data from identity server
2017-07-12 18:20:31.008 122345 INFO neutron.wsgi [-] 172.16.0.19 "GET /v2.0/networks.json HTTP/1.1" status: 503  len: 362 time: 13.8438971


Version-Release number of selected component (if applicable):
[heat-admin@overcloud-controller-0 ~]$ rpm -qa | grep neutron
python-neutron-lib-1.7.0-0.20170529134801.0ee4f4a.el7ost.noarch
python-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch
openstack-neutron-metering-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-neutronclient-6.3.0-0.20170601203754.ba535c6.el7ost.noarch
openstack-neutron-common-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch
openstack-neutron-l2gw-agent-10.1.0-0.20170619104111.4bb9806.el7ost.noarch
openstack-neutron-sriov-nic-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-ml2-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-linuxbridge-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
puppet-neutron-11.2.0-0.20170626053011.862f130.el7ost.noarch
openstack-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-openvswitch-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
[heat-admin@overcloud-controller-0 ~]$ rpm -qa | grep keystone
puppet-keystone-11.2.0-0.20170628054214.e7840bf.el7ost.noarch
python-keystonemiddleware-4.16.0-0.20170608235114.5bd1474.el7ost.noarch
python-keystoneclient-3.11.0-0.20170613145143.daa99ed.el7ost.noarch
python-keystoneauth1-2.21.0-0.20170613144756.0be1403.el7ost.noarch
openstack-keystone-12.0.0-0.20170628062203.e4c0c8d.el7ost.noarch
python-keystone-12.0.0-0.20170628062203.e4c0c8d.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Execute rally create-list-network scenario times:1000 concurrency: 64

Actual results:
+---------------------------------------------------------------------------------------------------------------------------+
|                                                   Response Times (sec)                                                    |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action                 | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_network | 0.965     | 1.447        | 39.376       | 63.021       | 117.748   | 11.164    | 98.0%   | 1000  |
| neutron.list_networks  | 0.535     | 11.913       | 31.061       | 42.132       | 119.78    | 15.354    | 99.8%   | 982   |
| total                  | 2.35      | 16.511       | 63.54        | 85.171       | 192.939   | 26.519    | 98.0%   | 1000  |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+



Expected results:
100% success -- as we had with Ocata : http://kibana.scalelab.redhat.com/goto/a1fba39fcf85294f46be49656dcf5f45

Additional info:

I tried :
Bumping the following values thus far:
haproxy:
- timeout  http-request 20s (was 10) -- problem still exists.

keystone-admin:
- processes: 24 (was 12) -- problem still exists.

Comment 1 Joe Talerico 2017-07-13 14:19:56 UTC

keystone-admin: 24 processes (was 12)
keystone-main: 24 processes (was 12)
haproxy: timeout http-request 20s (was 10s) 

^ configuration helps -- much less BADREQ's (75% less)

Additional information prior to the changes, I would see:
Jul 12 15:35:44 localhost haproxy[709736]: 192.168.24.54:58020 [12/Jul/2017:19:35:24.951] keystone_admin keystone_admin/<NOSRV> -1/-1/-1/-1/20000 408 212 - - cR-- 910/42/3/0/3 0/0 "<BADREQ>"

Comment 2 Ryan O'Hara 2017-07-17 14:37:03 UTC

I talked with Joe quite a bit about this and here are my thoughts:

The 408 error here is happening because of the http-request timeout. It seems like increasing this up to 30s will make the 408 errors go away, but I am very skeptical to increase this timeout because it seems that is just masking the actual problem.

The neutron client issues a request, which goes through haproxy. The neutron server that receives this request in turn creates a keystone request which is also goes through haproxy, and that is where things go badly. Say that 'option http-request' is set to 20s. That means that the client (in this case neutron itself is the "client" since it is sending the request to keystone) has 20s to send a complete HTTP request to haproxy before a timeout occurs. Read all about it here [1].

Note that one of the more interesting things about this timeout is it only applies to the HTTP header. From [1]:

"Note that this timeout only applies to the header part of the request, and
not to any data. As soon as the empty line is received, this timeout is not
used anymore."

So it seems like neutron's HTTP request to keystone (via haproxy) is not even sending the header within the timeout. That seems strange.

1.  http://cbonte.github.io/haproxy-dconv/1.5/configuration.html#timeout%20http-request

Comment 3 Joe Talerico 2017-07-17 20:26:29 UTC

(In reply to Joe Talerico from comment #0)
> Expected results:
> 100% success -- as we had with Ocata :
> http://kibana.scalelab.redhat.com/goto/a1fba39fcf85294f46be49656dcf5f45
Rethinking the results in the kibana link above... 

Ocata TripleO deployed Neutron incorrectly [1] -- so my Ocata deployment had 32 Neutron workers (not 12).

Increasing the worker count to 32 for neutron across all 3 controllers I see [2]. No errors, however, list_networks is taking ~ double the amount of time it previously did (not 100% on this, since I only had a single run to compare against, I am getting more data for this). 

[1] https://review.openstack.org/#/c/481587/2
[2] http://kibana.scalelab.redhat.com/goto/aaa4ff5ba8be7a4137ad0bc2f50438dc
[3] http://kibana.scalelab.redhat.com/goto/452b7a3998cc110e336358f3ea0da324

Comment 5 Brent Eagles 2018-01-24 15:05:17 UTC

This appears to be a worker count related issue so I'm closing a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1468018, which is current ON_DEV with a patch posted upstream (see https://review.openstack.org/536957)

*** This bug has been marked as a duplicate of bug 1468018 ***

Comment 6 Brent Eagles 2018-01-24 17:40:32 UTC

Re-opening. Joe pointed out that this is probably not related to the worker count issue in ocata - this was found against pike.

Comment 8 Brent Eagles 2018-02-20 04:20:19 UTC

@Joe, what are the chances of repeating this scenario? I would like to see if we capture a "slice" of the logs across the system for when this is happening.

Comment 9 Assaf Muller 2018-04-26 18:18:17 UTC

Let's re-open when anyone has access to HW and can reproduce, then we'll have folks jump on the live system.