Bug 1470363
Summary: | [Neutron] Keystone authentication issues | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Joe Talerico <jtaleric> |
Component: | openstack-tripleo | Assignee: | Brent Eagles <beagles> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Arik Chernetsky <achernet> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 12.0 (Pike) | CC: | amuller, beagles, jtaleric, mburns, racedoro, rhel-osp-director-maint, rohara |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | PerfScale | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-26 18:18:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Joe Talerico
2017-07-12 19:53:58 UTC
keystone-admin: 24 processes (was 12) keystone-main: 24 processes (was 12) haproxy: timeout http-request 20s (was 10s) ^ configuration helps -- much less BADREQ's (75% less) Additional information prior to the changes, I would see: Jul 12 15:35:44 localhost haproxy[709736]: 192.168.24.54:58020 [12/Jul/2017:19:35:24.951] keystone_admin keystone_admin/<NOSRV> -1/-1/-1/-1/20000 408 212 - - cR-- 910/42/3/0/3 0/0 "<BADREQ>" I talked with Joe quite a bit about this and here are my thoughts: The 408 error here is happening because of the http-request timeout. It seems like increasing this up to 30s will make the 408 errors go away, but I am very skeptical to increase this timeout because it seems that is just masking the actual problem. The neutron client issues a request, which goes through haproxy. The neutron server that receives this request in turn creates a keystone request which is also goes through haproxy, and that is where things go badly. Say that 'option http-request' is set to 20s. That means that the client (in this case neutron itself is the "client" since it is sending the request to keystone) has 20s to send a complete HTTP request to haproxy before a timeout occurs. Read all about it here [1]. Note that one of the more interesting things about this timeout is it only applies to the HTTP header. From [1]: "Note that this timeout only applies to the header part of the request, and not to any data. As soon as the empty line is received, this timeout is not used anymore." So it seems like neutron's HTTP request to keystone (via haproxy) is not even sending the header within the timeout. That seems strange. 1. http://cbonte.github.io/haproxy-dconv/1.5/configuration.html#timeout%20http-request (In reply to Joe Talerico from comment #0) > Expected results: > 100% success -- as we had with Ocata : > http://kibana.scalelab.redhat.com/goto/a1fba39fcf85294f46be49656dcf5f45 Rethinking the results in the kibana link above... Ocata TripleO deployed Neutron incorrectly [1] -- so my Ocata deployment had 32 Neutron workers (not 12). Increasing the worker count to 32 for neutron across all 3 controllers I see [2]. No errors, however, list_networks is taking ~ double the amount of time it previously did (not 100% on this, since I only had a single run to compare against, I am getting more data for this). [1] https://review.openstack.org/#/c/481587/2 [2] http://kibana.scalelab.redhat.com/goto/aaa4ff5ba8be7a4137ad0bc2f50438dc [3] http://kibana.scalelab.redhat.com/goto/452b7a3998cc110e336358f3ea0da324 This appears to be a worker count related issue so I'm closing a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1468018, which is current ON_DEV with a patch posted upstream (see https://review.openstack.org/536957) *** This bug has been marked as a duplicate of bug 1468018 *** Re-opening. Joe pointed out that this is probably not related to the worker count issue in ocata - this was found against pike. @Joe, what are the chances of repeating this scenario? I would like to see if we capture a "slice" of the logs across the system for when this is happening. Let's re-open when anyone has access to HW and can reproduce, then we'll have folks jump on the live system. |