Description of problem: Customer tested LDAP fail-over mechanisms for setups when "url" parameter contains multiple LDAP URIs and wasn't happy about CLI and especially Horizon delays when first URI is no longer available. We have tried to troubleshoot this problem and do some reverse engineering, please find our observations below. First of all, we have reported bug #1891821 because keystone is containerized now and it is impossible to adjust NETWORK_TIMEOUT definition for keystone's LDAP backend without tuning ldap.conf inside container. We also found out that ldappool is likely following order when initiating new connections using configured LDAP URIs: first URI is used first, then second URI is used if first one is no longer available. As a result, extra NETWORK_TIMEOUT delay is added for every LDAP request when first URI is no longer available. There may be reasons for this kind of behavior, but it cause significant degradation when first LDAP server is down. We have used "openstack token issue" command to measure pure API performance and issued tokens for: - admin user stored in SQL DB (took 1.781s) - user from LDAP user DB when all LDAP servers are up and running (took 4.797s for first call and 1.927s for the call made shortly after the first one) - user from LDAP user DB when first LDAP server is down and NETWORK_TIMEOUT 1 is configured (took 8.008s and 3.941s for the call made shortly after the first one) Situation is much worse for Horizon: it issues scoped tokens for every login attempt, which requires extra API calls. As a result, it takes 10 seconds more to login after LDAP server is down: - admin user stored in SQL DB (took ~2,5) - user from LDAP user DB when all LDAP servers are up and running (took ~5s) - user from LDAP user DB when first LDAP server is down and NETWORK_TIMEOUT 1 is configured (took ~15s) I would like to report this bug to ask for a second look from keystone developers. It looks like failover algorithms for LDAP backend are suboptimal for situations when first LDAP server is down: from code it looks like keystone re-uses pool of existing LDAP connections, but extra penalty is still added for requests made right after another. I am also wondering if we can introduce an option to randomize LDAP servers when passing them to ldappool? Please let me know if I am missing something.
We're currently working on reproducing this BZ with an OSP13 environment and setting up two ldaps servers of it and see what we can do to reproduce the issue and see the best approach to have it fixed, if we found some issue on the keystone+ldap code.