Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1350877

Summary:

ns-slapd (FreeIPA) process high CPU load up to 200%

Product:

Red Hat Enterprise Linux 7

Reporter:

Valentin Bajrami <valentin.bajrami>

Component:

389-ds-base

Assignee:

Noriko Hosoi <nhosoi>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Viktor Ashirov <vashirov>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

7.2

CC:

emre.sokullu, lkrispen, nkinder, rmeggins, tbordaz, valentin.bajrami, wibrown

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-24 17:27:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
pstack $(pgrep ns-slapd) > ns-slapd log	none
/var/log/dirserv/slapd-INTRA-DOMAIN-TLD-NL/{access,errors} log files	none
print screen 1 IPA configuration	none
print screen 2 IPA configuration	none
ipamaster cpuinfo physical machine	none
ipamaster-1 cpuinfo	none

Description Valentin Bajrami 2016-06-28 15:00:17 UTC

Created attachment 1173490 [details]
pstack $(pgrep ns-slapd)  > ns-slapd log

Description of problem:


FreeIPA VERSION: 4.2.0, API_VERSION: 2.156 on SL7

When running `pssh` command to lock/unlock local accounts for the remote virgo-hosts, the ns-slapd process on the ipaserver consumes 200% of the cpu time. 


How reproducible:


Steps to Reproduce:
1.  pssh -O "StrictHostKeyChecking=no" -h /Software/users/management/virgo-hosts -p 1 -l root -i 'passwd -l "$(hostname -s)"' | grep -A 2 '[0-9]'

2. virgo-hosts file content
virgo02
virgo03
virgo04
virgo05
virgo06
virgo07
virgo08
virgo09
virgo10
virgo11
virgo12
virgo13
virgo14
virgo15
virgo16
virgo17
virgo18
virgo19
virgo20
virgo21
virgo22
virgo23
virgo24
virgo26
virgo27

3. /etc/nsswitch.conf on the virgo machines
passwd:     files sss
shadow:     files sss
group:      files sss
hosts:      files dns myhostname
bootparams: nisplus [NOTFOUND=return] files
ethers:     files
netmasks:   files
networks:   files
protocols:  files
rpc:        files
services:   files sss
netgroup:   files sss
publickey:  nisplus
automount: files ldap
aliases:    files nisplus
sudoers: files sss


4. Running `top` on ipa server.
5. ns-slapd showing 200% cpu usage.


Actual results:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU   %MEM     TIME+ COMMAND                                                                      
32301 dirsrv    20   0 1756036 565552  23344 S  200.1% 18.6 178:56.88 ns-slapd   


Expected results:
ns-slapd process shouldn't consume 200% of the cpu time.

Additional info:

Comment 2 Noriko Hosoi 2016-06-28 17:17:53 UTC

Please provide this output.
# rpm -qa | egrep "389-ds-base|ipa"

Comment 3 Valentin Bajrami 2016-06-29 08:05:40 UTC

The output of the above command:

ipa-server-dns-4.2.0-15.sl7_2.6.x86_64
sssd-ipa-1.13.0-40.el7_2.1.x86_64
ipa-python-4.2.0-15.sl7_2.6.x86_64
389-ds-base-debuginfo-1.3.4.0-26.el7_2.x86_64
ipa-client-4.2.0-15.sl7_2.6.x86_64
389-ds-base-libs-1.3.4.0-26.el7_2.x86_64
389-ds-base-1.3.4.0-26.el7_2.x86_64
ipa-server-4.2.0-15.sl7_2.6.x86_64
libipa_hbac-1.13.0-40.el7_2.1.x86_64
python-libipa_hbac-1.13.0-40.el7_2.1.x86_64
ipa-admintools-4.2.0-15.sl7_2.6.x86_64
python-iniparse-0.4-9.el7.noarch

Comment 4 Noriko Hosoi 2016-06-29 21:37:43 UTC

From the Directory Server's point of view, it looks it is just doing its job.

I see 22 threads are in the Virtual attribute code from ACL via the deref plugin.   The deref plugin checks the access rights with deref_check_access.  The check is done against every attribute including the virtual attributes.  All the busy threads are competing the mutex in vattr_map_lookup.

I'm wondering how IPA configures the Directory Server for pssh.
Thierry and Ludwig, do you have any clue?

Valentin, did you have a chance to see the Directory Server's log files (/var/log/dirsrv/slapd-INSTANCE/{access,errors})?  It may be useful to see for further analysis.

Also, the latest 389-ds-base on rhel-7.2 is 1.3.4.0-32 although bugs fixed since 1.3.4.0-26 do not look related to this issue.  Could IPA have a newer version available (e.g., ipa-4.2.0-15.el7_2.17?)

Comment 5 Ludwig 2016-06-30 08:58:51 UTC

I agree with Noriko, the ds seems just to do it's work, but there are multiple searches with deref controls which can result in heavy load.

I think this is/was a problem of sssd and it is either fixed or canbe configured not to do so many deref searches, but this would have to be asked to sssd experts

Comment 6 thierry bordaz 2016-06-30 09:20:36 UTC

The vattr_map lock is a RW lock and there is no contention on it (it consums CPU and no call to update the map). several backstacks with lock acquisition is likely just a side effect of the way pstack is stopping the process.

I agree it is looking like DS is doing normal deref processing. There is a way to prevent sssd to use deref control but it may flowed DS with SRCH reqs just to do what deref is doing. AFAIK using deref was a gain both for SSSD and DS.

This high CPU consumption is likely the consequence of a high load coming from pssd. I agree with Noriko, access/error logs would really help.

Did the test pass before without such high DS consumption ?

Comment 7 Valentin Bajrami 2016-06-30 09:29:44 UTC

Created attachment 1174427 [details]
/var/log/dirserv/slapd-INTRA-DOMAIN-TLD-NL/{access,errors} log files

Noriko, 

As requested, attached you'll find the access and errors log.  Thanks for having a look at the issue.

Comment 9 Valentin Bajrami 2016-07-07 12:19:20 UTC

While I'm waiting for a reply, I did a few other tests using pssh (Parallel SSH).   Substituting  `passwd` with e.g  `dig`  doesn't seem to trigger the high cpu usage. Why would `passwd` cause such high load?  

What processes are involved when `passwd` is invoked and what is the relationship to /etc/pam.d/passwd, /etc/pam.d/system-auth,  /etc/nsswitch.conf and /etc/sss/sssd.conf ?

Comment 10 Valentin Bajrami 2016-07-07 15:12:57 UTC

While the issue was persisting, I went and asked the question on IRC channel #freeipa.  ab:  suggested me to change the nsslapd-threadnumber global configuration attribute in the cn=config entry. 

After changing the nsslapd-threadnumber attribute, the problem seems to be solved. ns-slapd service doesn't consume 200% of the cpu anymore when 'pssh' and 'passwd' processes are executed.

To change the nsslapd-threadnumber attribute, I used the following command:

# ldapmodify -x -W -h localhost -p 389 -D "cn=Directory Manager"
Enter LDAP Password: 
dn: cn=config
changetype: modify
replace: nsslapd-threadnumber
nsslapd-threadnumber: 70

modifying entry "cn=config"


To see if the change took place, I used:

# ldapsearch -x -W -h localhost -p 389 -D "cn=Directory Manager" -b "cn=config" | grep nsslapd-threadnumber
Enter LDAP Password: 
nsslapd-threadnumber: 70

Comment 11 Noriko Hosoi 2016-07-07 20:22:27 UTC

Thank you for your updates, Valentin!

So, it was a thread starvation...

It looks to me it is a good candidate for our knowledge base.

As of this bug, may we close it as NOTABUG?

Comment 12 Valentin Bajrami 2016-07-08 08:42:08 UTC

Hi Noriko, It indeed seems to be a thread starvation. It would be a good idea to have this attribute dynamic so the threads are adjusted based on the actual needs. 

You can close this report and label it as NOTABUG or whatever you think the proper label should be.

Thanks again for having a look into it.

Comment 13 thierry bordaz 2016-07-08 08:57:00 UTC

That is right there are evidence of starvation:

[30/Jun/2016:11:14:22 +0200] conn=34617 op=0 SRCH base="" scope=0 filter="(objectClass=*)" attrs="* altServer namingContexts supportedControl supportedExtension supportedFeatures supportedLDAPVersion supportedSASLMechanisms domaincontrollerfunctionality defaultnamingcontext lastusn highestcommittedusn aci"
[30/Jun/2016:11:14:22 +0200] conn=34617 op=0 RESULT err=0 tag=101 nentries=1 etime=2               
...
[30/Jun/2016:11:14:36 +0200] conn=34630 op=0 SRCH base="" scope=0 filter="(objectClass=*)" attrs="* altServer namingContexts supportedControl supportedExtension supportedFeatures supportedLDAPVersion supportedSASLMechanisms domaincontrollerfunctionality defaultnamingcontext lastusn highestcommittedusn aci"
[30/Jun/2016:11:14:36 +0200] conn=34630 op=0 RESULT err=0 tag=101 nentries=1 etime=2       

Those request spent 2 sec in the work queue waiting for workers.


Now I have some difficulties to understand how to correlate starvation with 200% cpu and increasing the number of threads reduces the cpu consumption.

If during starvation CPU is 200%, reducing starvation allows more threads to run and CPU should be higher.

A possibility is that increasing the #threads, eliminates the backlog of req.
This backlog is creating the 200% when starvation ends.

Comment 14 Valentin Bajrami 2016-07-08 09:09:30 UTC

Thierry you've a good point there.  The high cpu load which is consumed by the ns-slapd service is affected *after* command execution (pssh and passwd) has ended. This ineed confirms (what you just said) that the backlog is causing the problem.

Comment 15 thierry bordaz 2016-07-08 10:01:49 UTC

Thanks Valentin for the confirmation. 
So this bug of high CPU was triggered by starvation is now fixed by tuning.
Now it reveal a perf issue in ACI evaluation.
By any chance do you still have how you configured IPA (hosts,groups,users...) ?

Comment 16 Valentin Bajrami 2016-07-08 10:33:42 UTC

Created attachment 1177605 [details]
Full ldap configuration

Comment 17 Valentin Bajrami 2016-07-08 10:35:05 UTC

Created attachment 1177606 [details]
print screen 1  IPA configuration

Ipa configuration print screen 1

Comment 18 Valentin Bajrami 2016-07-08 10:35:38 UTC

Created attachment 1177607 [details]
print screen 2  IPA configuration

Ipa configuration print screen 2

Comment 20 wibrown@redhat.com 2016-10-27 23:39:44 UTC

> Hi Noriko, It indeed seems to be a thread starvation. It would be a good
> idea to have this attribute dynamic so the threads are adjusted based on the
> actual needs. 

This is actually insanely hard to achieve in practice. It adds a contention and complexity into the server for managing threads needed, their startup / shutdown. It's better to just over-allocate threads and let them idle instead.

I think that it would be interesting to see the output from lscpu of your host to see how many threads / cores it has. 

We have some tuning advice that says hardware threads * 2 == threadcount you should set. IE 16 thread system would run with 32 DS threads.

Perhaps given modern systems this number could be adjusted? I think it could be worth us exploring an automatic thread tuning by default.

Comment 23 Valentin Bajrami 2016-11-15 11:47:32 UTC

Created attachment 1220809 [details]
ipamaster cpuinfo physical machine

I've attached the output of /proc/cpuinfo from both machines 

ipamaster <- is a physical machine
ipamaster-1 <- is a virtual machine

Comment 24 Valentin Bajrami 2016-11-15 11:48:08 UTC

Created attachment 1220810 [details]
ipamaster-1 cpuinfo

Comment 25 emre.sokullu 2018-04-30 18:02:59 UTC

how did do this problem? Our system some problem.