Bug 1572789 - [rfe] A way for keystone to blacklist a server from the ldap pool for a set amount of time to reduce the time it takes to get a response from the ldap.
Summary: [rfe] A way for keystone to blacklist a server from the ldap pool for a set a...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-keystone
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: John Dennis
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-27 20:58 UTC by rosingh
Modified: 2018-06-12 17:31 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-17 13:48:35 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description rosingh 2018-04-27 20:58:01 UTC
Description of problem:
rfe for keystone to blacklist a server from the ldap pool for a set amount of time to reduce the time it takes to get a response from the ldap in OSP 12 related to case 02088311

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Mircea Vutcovici 2018-04-27 21:19:07 UTC
One solution would be to use haproxy with a ldap health check, something similar to https://gist.github.com/kevin39/3db2cb05e79fb752c80d

Comment 3 John Dennis 2018-05-14 21:33:23 UTC
Answering the question in the customer case.

Currently the only option available is to set one or more of the following keystone ldap options to lower values:

    pool_retry_max
    pool_retry_delay
    pool_connection_timeout

The original customer case did not specify if they were passing multiple ldap uri's for the LDAP URL config value. I'm assuming yes because multiple uri's is the only way to have Keystone try multiple LDAP servers.

Note, the LDAP pool implementation in Keystone *ONLY* applies to a two-tuple of (bind_principal, url). Therefore the pool applies to the entire list of uri's. The pool DOES NOT ITERATE over the uri list trying servers in succession.

Rather the comma separated list of uri's is passed to the python-ldap module which does not touch it. python-ldap passes it unmodified to the underlying ldap C library (e.g. openldap). openldap performs the uri iteration and the ordering of the iteration cannot be specified.

In summary, the particular server tried in the list of servers is implemented way below Keystone. Keystone does not have functionality to manage individual servers in a pool, nor does it have the ability to detect when a server goes offline and adjust the pool entries.

Adding this logic to Keystone would be significant. This is not something Red Hat would do unilaterally, it would have to be an upstream effort with a blueprint.

Comment 4 John Dennis 2018-05-15 14:03:33 UTC
To make sure I understand what the customer is trying to achieve let me reiterate my interpretation.

They have multiple LDAP servers each serving the same content. At some point one of the servers may be down. If one of the servers is down they do not want to pay a performance penalty when Keystone attempts to access the offline server and has to timeout before proceeding to another server. They specify the pool of LDAP servers by including them in a comma separated list in the Keystone configuration file. Did I get this correct?

By the way this is the type of critical information that should always be included in a problem report so we don't have to guess. Also rather than proposing a solution and hoping for an RFE that implements that solution it's much better to state the original requirement and then we can see what possible solutions might resolve the issue.

On the assumption the above is correct then this is a classic High Availability (HA) problem and there are many existing tools dedicated to solving it that can be used outside the context of Keystone (i.e. load balancers). The best known tool and one that Director uses in a HA configuration is HAProxy. HAProxy is capable of managing a set of servers in a pool, it can detect when one is down and removed it from the pool. It can run periodic health checks on the server and bring it back into the pool automatically. For all running servers in the pool it dispatches requests among the servers according to a variety of strategies (round-robin being the default). To coordinate Keystone with HAProxy one would specify just one LDAP address in the Keystone configuration file, that address would be the front end (public) address presented by HAProxy. Internally HAProxy will route the request to one of the LDAP servers in the pool, this will be invisible to Keystone.

Comment 5 John Dennis 2018-05-15 14:39:31 UTC
Adding notes to myself here based on my investigation of the LDAP pool implementation and what would need to be done to achieve the original RFE.

The first thing I looked for in Keystone was any code that trapped the ldap.TIMEOUT exception because that would be necessary to fail over to the next server in the Keystone pool. In addition it would be necessary for Keystone to know the individual servers in the list specified in the ldap.url configuration value (currently they are comma separated) so Keystone could rotate to the next server in the list. But Keystone does not manage the ldap.url as a list, rather it treats it as an opaque string that it passes into ldap.initialize() which in turn passes it unmodified to the openldap initialize() function. Therefore I conclude Keystone despite managing a pool of LDAP servers is NOT capable of handling failover, rather it's just a way to spread the load.

To implement ldap pool failover in Keystone I believe the blueprint would be composed of 3 implementation stages.

Stage 1:

* Split the ldap.url config item into a list of servers and manage them as a list of individual servers.

* Every ldap function would need to trap TIMEOUT exceptions. The failed server would be removed from the pool and placed on a quarantine list for some duration. The request would be resubmitted using the next server in the pool. There are two ways I can thing of to add this extra functionality to each ldap call. Override the method in the LDAP Pool Class or write a decorator that wraps the function.

* A new config item would need to be added that controls the quarantine duration.

* Pool management would need to augmented to manage the quarantine and active server list.

Stage 2:

You don't want to stop and restart Keystone to manage the pool. Admins will need to know the state of the pool in a running Keystone and because there will be a desire to blacklist a server from consideration or to move a server from the blacklist to the active pool while Keystone is running it will be necessary to add an API call that returns the state of the pool and to be able to set flags on individual servers in the pool (e.g. mark them as active or inactive).

Stage 3:

The management functionality in Stage 2 is limited because it applies to only one Keystone server and any values assigned to a pool member through the management API would be lost on Keystone restart. It would be better to store the pool data in some type of persistent storage that would persist across restarts and could be communally shared among cooperating Keystone servers. However this would conflict the existing specification of the server list in the config file, something would need to be done to resolve the two sources of authoritative information on who is a member of the pool.

Given the above adds a fair amount of complexity and the fact it is essentially re-implementing functionality already available in other tools I don't see much value in attempting to do this in Keystone.

Comment 6 John Dennis 2018-05-17 13:48:35 UTC
Closing this as WONTFIX, there are better ways to address this problem outside of Keystone and there is no easy way to implement the request in the existing keystone base.

Comment 7 John Dennis 2018-06-12 17:31:26 UTC
Note: bug #1561070 contains some info on how to obtain ldap connection diagnostics. This would be useful because rather than trying to blacklist a server we need to understand why a non-responsive server is delaying ldap operations, that can only happen if we understand what is happing in the python-ldap ReconnectLDAPObject and the OpenStack ldappool manager.


Note You need to log in before you can comment on or make changes to this bug.