Bug 1899127 - Failover mechanisms for keystone LDAP backend are causing huge delays for some Horizon operations if one LDAP server is down
Summary: Failover mechanisms for keystone LDAP backend are causing huge delays for som...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-keystone
Version: 13.0 (Queens)
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: ---
Assignee: Grzegorz Grasza
QA Contact: Jeremy Agee
URL:
Whiteboard:
Depends On: 2024602
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-18 15:11 UTC by Alex Stupnikov
Modified: 2023-08-15 08:27 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2024602 (view as bug list)
Environment:
Last Closed: 2022-08-11 18:37:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-151 0 None None None 2021-11-17 16:04:48 UTC

Description Alex Stupnikov 2020-11-18 15:11:28 UTC
Description of problem:

Customer tested LDAP fail-over mechanisms for setups when "url" parameter contains multiple LDAP URIs and wasn't happy about CLI and especially Horizon delays when first URI is no longer available. We have tried to troubleshoot this problem and do some reverse engineering, please find our observations below.

First of all, we have reported bug #1891821 because keystone is containerized now and it is impossible to adjust NETWORK_TIMEOUT definition for keystone's LDAP backend without tuning ldap.conf inside container.

We also found out that ldappool is likely following order when initiating new connections using configured LDAP URIs: first URI is used first, then second URI is used if first one is no longer available. As a result, extra NETWORK_TIMEOUT delay is added for every LDAP request when first URI is no longer available. There may be reasons for this kind of behavior, but it cause significant degradation when first LDAP server is down.

We have used "openstack token issue" command to measure pure API performance and issued tokens for:

- admin user stored in SQL DB (took 1.781s)
- user from LDAP user DB when all LDAP servers are up and running (took 4.797s for first call and 1.927s for the call made shortly after the first one)
- user from LDAP user DB when first LDAP server is down and NETWORK_TIMEOUT 1 is configured (took 8.008s and 3.941s for the call made shortly after the first one)


Situation is much worse for Horizon: it issues scoped tokens for every login attempt, which requires extra API calls. As a result, it takes 10 seconds more to login after LDAP server is down:

- admin user stored in SQL DB (took ~2,5)
- user from LDAP user DB when all LDAP servers are up and running (took ~5s)
- user from LDAP user DB when first LDAP server is down and NETWORK_TIMEOUT 1 is configured (took ~15s)


I would like to report this bug to ask for a second look from keystone developers. It looks like failover algorithms for LDAP backend are suboptimal for situations when first LDAP server is down: from code it looks like keystone re-uses pool of existing LDAP connections, but extra penalty is still added for requests made right after another. I am also wondering if we can introduce an option to randomize LDAP servers when passing them to ldappool?

Please let me know if I am missing something.

Comment 1 Raildo Mascena de Sousa Filho 2020-12-08 15:19:47 UTC
We're currently working on reproducing this BZ with an OSP13 environment and setting up two ldaps servers of it and see what we can do to reproduce the issue and see the best approach to have it fixed, if we found some issue on the keystone+ldap code.


Note You need to log in before you can comment on or make changes to this bug.