Bug 1350877
| Summary: | ns-slapd (FreeIPA) process high CPU load up to 200% | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Valentin Bajrami <valentin.bajrami> |
| Component: | 389-ds-base | Assignee: | Noriko Hosoi <nhosoi> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Viktor Ashirov <vashirov> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.2 | CC: | emre.sokullu, lkrispen, nkinder, rmeggins, tbordaz, valentin.bajrami, wibrown |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-24 17:27:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
Please provide this output. # rpm -qa | egrep "389-ds-base|ipa" The output of the above command: ipa-server-dns-4.2.0-15.sl7_2.6.x86_64 sssd-ipa-1.13.0-40.el7_2.1.x86_64 ipa-python-4.2.0-15.sl7_2.6.x86_64 389-ds-base-debuginfo-1.3.4.0-26.el7_2.x86_64 ipa-client-4.2.0-15.sl7_2.6.x86_64 389-ds-base-libs-1.3.4.0-26.el7_2.x86_64 389-ds-base-1.3.4.0-26.el7_2.x86_64 ipa-server-4.2.0-15.sl7_2.6.x86_64 libipa_hbac-1.13.0-40.el7_2.1.x86_64 python-libipa_hbac-1.13.0-40.el7_2.1.x86_64 ipa-admintools-4.2.0-15.sl7_2.6.x86_64 python-iniparse-0.4-9.el7.noarch From the Directory Server's point of view, it looks it is just doing its job.
I see 22 threads are in the Virtual attribute code from ACL via the deref plugin. The deref plugin checks the access rights with deref_check_access. The check is done against every attribute including the virtual attributes. All the busy threads are competing the mutex in vattr_map_lookup.
I'm wondering how IPA configures the Directory Server for pssh.
Thierry and Ludwig, do you have any clue?
Valentin, did you have a chance to see the Directory Server's log files (/var/log/dirsrv/slapd-INSTANCE/{access,errors})? It may be useful to see for further analysis.
Also, the latest 389-ds-base on rhel-7.2 is 1.3.4.0-32 although bugs fixed since 1.3.4.0-26 do not look related to this issue. Could IPA have a newer version available (e.g., ipa-4.2.0-15.el7_2.17?)
I agree with Noriko, the ds seems just to do it's work, but there are multiple searches with deref controls which can result in heavy load. I think this is/was a problem of sssd and it is either fixed or canbe configured not to do so many deref searches, but this would have to be asked to sssd experts The vattr_map lock is a RW lock and there is no contention on it (it consums CPU and no call to update the map). several backstacks with lock acquisition is likely just a side effect of the way pstack is stopping the process. I agree it is looking like DS is doing normal deref processing. There is a way to prevent sssd to use deref control but it may flowed DS with SRCH reqs just to do what deref is doing. AFAIK using deref was a gain both for SSSD and DS. This high CPU consumption is likely the consequence of a high load coming from pssd. I agree with Noriko, access/error logs would really help. Did the test pass before without such high DS consumption ? Created attachment 1174427 [details]
/var/log/dirserv/slapd-INTRA-DOMAIN-TLD-NL/{access,errors} log files
Noriko,
As requested, attached you'll find the access and errors log. Thanks for having a look at the issue.
While I'm waiting for a reply, I did a few other tests using pssh (Parallel SSH). Substituting `passwd` with e.g `dig` doesn't seem to trigger the high cpu usage. Why would `passwd` cause such high load? What processes are involved when `passwd` is invoked and what is the relationship to /etc/pam.d/passwd, /etc/pam.d/system-auth, /etc/nsswitch.conf and /etc/sss/sssd.conf ? While the issue was persisting, I went and asked the question on IRC channel #freeipa. ab: suggested me to change the nsslapd-threadnumber global configuration attribute in the cn=config entry. After changing the nsslapd-threadnumber attribute, the problem seems to be solved. ns-slapd service doesn't consume 200% of the cpu anymore when 'pssh' and 'passwd' processes are executed. To change the nsslapd-threadnumber attribute, I used the following command: # ldapmodify -x -W -h localhost -p 389 -D "cn=Directory Manager" Enter LDAP Password: dn: cn=config changetype: modify replace: nsslapd-threadnumber nsslapd-threadnumber: 70 modifying entry "cn=config" To see if the change took place, I used: # ldapsearch -x -W -h localhost -p 389 -D "cn=Directory Manager" -b "cn=config" | grep nsslapd-threadnumber Enter LDAP Password: nsslapd-threadnumber: 70 Thank you for your updates, Valentin! So, it was a thread starvation... It looks to me it is a good candidate for our knowledge base. As of this bug, may we close it as NOTABUG? Hi Noriko, It indeed seems to be a thread starvation. It would be a good idea to have this attribute dynamic so the threads are adjusted based on the actual needs. You can close this report and label it as NOTABUG or whatever you think the proper label should be. Thanks again for having a look into it. That is right there are evidence of starvation: [30/Jun/2016:11:14:22 +0200] conn=34617 op=0 SRCH base="" scope=0 filter="(objectClass=*)" attrs="* altServer namingContexts supportedControl supportedExtension supportedFeatures supportedLDAPVersion supportedSASLMechanisms domaincontrollerfunctionality defaultnamingcontext lastusn highestcommittedusn aci" [30/Jun/2016:11:14:22 +0200] conn=34617 op=0 RESULT err=0 tag=101 nentries=1 etime=2 ... [30/Jun/2016:11:14:36 +0200] conn=34630 op=0 SRCH base="" scope=0 filter="(objectClass=*)" attrs="* altServer namingContexts supportedControl supportedExtension supportedFeatures supportedLDAPVersion supportedSASLMechanisms domaincontrollerfunctionality defaultnamingcontext lastusn highestcommittedusn aci" [30/Jun/2016:11:14:36 +0200] conn=34630 op=0 RESULT err=0 tag=101 nentries=1 etime=2 Those request spent 2 sec in the work queue waiting for workers. Now I have some difficulties to understand how to correlate starvation with 200% cpu and increasing the number of threads reduces the cpu consumption. If during starvation CPU is 200%, reducing starvation allows more threads to run and CPU should be higher. A possibility is that increasing the #threads, eliminates the backlog of req. This backlog is creating the 200% when starvation ends. Thierry you've a good point there. The high cpu load which is consumed by the ns-slapd service is affected *after* command execution (pssh and passwd) has ended. This ineed confirms (what you just said) that the backlog is causing the problem. Thanks Valentin for the confirmation. So this bug of high CPU was triggered by starvation is now fixed by tuning. Now it reveal a perf issue in ACI evaluation. By any chance do you still have how you configured IPA (hosts,groups,users...) ? Created attachment 1177605 [details]
Full ldap configuration
Created attachment 1177606 [details]
print screen 1 IPA configuration
Ipa configuration print screen 1
Created attachment 1177607 [details]
print screen 2 IPA configuration
Ipa configuration print screen 2
> Hi Noriko, It indeed seems to be a thread starvation. It would be a good
> idea to have this attribute dynamic so the threads are adjusted based on the
> actual needs.
This is actually insanely hard to achieve in practice. It adds a contention and complexity into the server for managing threads needed, their startup / shutdown. It's better to just over-allocate threads and let them idle instead.
I think that it would be interesting to see the output from lscpu of your host to see how many threads / cores it has.
We have some tuning advice that says hardware threads * 2 == threadcount you should set. IE 16 thread system would run with 32 DS threads.
Perhaps given modern systems this number could be adjusted? I think it could be worth us exploring an automatic thread tuning by default.
Created attachment 1220809 [details]
ipamaster cpuinfo physical machine
I've attached the output of /proc/cpuinfo from both machines
ipamaster <- is a physical machine
ipamaster-1 <- is a virtual machine
Created attachment 1220810 [details]
ipamaster-1 cpuinfo
how did do this problem? Our system some problem. |
Created attachment 1173490 [details] pstack $(pgrep ns-slapd) > ns-slapd log Description of problem: FreeIPA VERSION: 4.2.0, API_VERSION: 2.156 on SL7 When running `pssh` command to lock/unlock local accounts for the remote virgo-hosts, the ns-slapd process on the ipaserver consumes 200% of the cpu time. How reproducible: Steps to Reproduce: 1. pssh -O "StrictHostKeyChecking=no" -h /Software/users/management/virgo-hosts -p 1 -l root -i 'passwd -l "$(hostname -s)"' | grep -A 2 '[0-9]' 2. virgo-hosts file content virgo02 virgo03 virgo04 virgo05 virgo06 virgo07 virgo08 virgo09 virgo10 virgo11 virgo12 virgo13 virgo14 virgo15 virgo16 virgo17 virgo18 virgo19 virgo20 virgo21 virgo22 virgo23 virgo24 virgo26 virgo27 3. /etc/nsswitch.conf on the virgo machines passwd: files sss shadow: files sss group: files sss hosts: files dns myhostname bootparams: nisplus [NOTFOUND=return] files ethers: files netmasks: files networks: files protocols: files rpc: files services: files sss netgroup: files sss publickey: nisplus automount: files ldap aliases: files nisplus sudoers: files sss 4. Running `top` on ipa server. 5. ns-slapd showing 200% cpu usage. Actual results: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32301 dirsrv 20 0 1756036 565552 23344 S 200.1% 18.6 178:56.88 ns-slapd Expected results: ns-slapd process shouldn't consume 200% of the cpu time. Additional info: