Bug 1094277

Summary:

IPA Server Slow Performance, high CPU usage of ns-slapd

Product:

Red Hat Enterprise Linux 6

Reporter:

Juan Cunanan <juan_cunanan>

Component:

389-ds-base

Assignee:

Noriko Hosoi <nhosoi>

Status:

CLOSED ERRATA

QA Contact:

Sankar Ramalingam <sramling>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

6.0

CC:

abokovoy, jgalipea, juan_cunanan, lkrispen, nhosoi, nkinder, nsoman, rcritten, rmeggins

Target Milestone:

pre-dev-freeze

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

389-ds-base-1.2.11.15-34.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-10-14 07:55:00 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
.toprc - copy to users $HOME directory before running top	none
pstack	none
top	none
partial output of logconv.rpt.	none
top v2	none
gdb_stacktrace1	none
gdb_stacktrace2	none
dse.ldif	none
top v3 - after remove	none

Description Juan Cunanan 2014-05-05 11:55:27 UTC

Description of problem:
High CPU performance and slow response of IPA Masters and replicas

Version-Release number of selected component (if applicable): 
ipa-server-3.0.0-37.el6.x86_64


How reproducible: Always


Steps to Reproduce:
1. On system running ipa-server-3.0.0-37, Master with 17 replicas, 1083 users, 1291 groups -- there is performance degradation where IPA master and slave encounter high CPU utilization (going over 100% mostly) and having random slow responses
2. System has been running for several months now migrated from ipa-1.1 and the issue does not happen on the old system
3. The utilization and LDAP response time ranges from 6 seconds (non-sssd) and goes over 15-20 seconds when having peak CPU usage of ns-slapd process
4. IPA admin UI encounters random Internal Server Timeout error when peak CPU usage is experienced.
5. Other features such as sudo, CA, HBAC are not implemented. Only LDAP and KRB features are being used.


Actual results:

CPU usage of ns-slapd does not go below 150%
===========================================================================
top - 11:51:09 up 52 days,  2:54,  2 users,  load average: 2.32, 2.05, 2.11
Tasks: 240 total,   3 running, 237 sleeping,   0 stopped,   0 zombie
Cpu(s): 27.0%us,  0.6%sy,  0.0%ni, 72.0%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  49417172k total,  8026116k used, 41391056k free,   158144k buffers
Swap:  8388600k total,        0k used,  8388600k free,  6395260k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
25428 dirsrv    20   0 4900m 193m  18m R 583.9  0.4 102:47.97 ns-slapd
23810 root      20   0  213m  18m  18m S  2.0  0.0   0:53.22 sssd_nss
===========================================================================

IPA find user time of ~6 seconds, which is slow

===========================================================================
[root@sjdc-nixdc1 lib64]# time ipa user-find ztrend
--------------
1 user matched
--------------
  User login: ztrend
  First name: Zenoss
  Last name: Account
  Home directory: /home/ztrend
  Login shell: /bin/bash
  Email address: dcsmonitor
  UID: 10837
  GID: 10837
  Account disabled: False
  Password: True
  Kerberos keys available: True
----------------------------
Number of entries returned 1
----------------------------

real    0m5.396s
user    0m0.534s
sys     0m0.069s
===========================================================================

Expected results:

Prior to 3.0 upgrade we are using Red Hat IPA v1.1 which the issue is not encountered.

Additional info:

Comment 4 Ludwig 2014-05-07 08:21:20 UTC

Was the hardware also changed with the upgrade ?

Could you provide the output of top -H -p <slapd-pid> and a sequence of pstacks for the time of high cpu usage ?

Comment 5 Ludwig 2014-05-08 07:16:06 UTC

Could you please also run:

Please have the customer run logconv.pl -V /var/log/dirsrv/slapd-DOMAIN-TLD/access.20* /var/log/dirsrv/slapd-DOMAIN-TLD/access > logconv.rpt 2>&1

and attach logconv.rpt to the bug/ticket

Comment 6 Juan Cunanan 2014-05-08 08:17:19 UTC

Yes hardware was changed and upgraded when 1.1 to 3.0 upgrade was done. The new machine is more powerful.

old machine was a 4 core/16G RAM server. New machine is 8core/48G RAM server.


top output:
==========================================================================
top - 08:16:22 up 54 days, 23:20,  3 users,  load average: 2.79, 2.10, 1.66
Tasks:  58 total,   4 running,  54 sleeping,   0 stopped,   0 zombie
Cpu(s): 68.0%us,  1.5%sy,  0.0%ni, 30.3%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  49417172k total, 11090812k used, 38326360k free,   158300k buffers
Swap:  8388600k total,        0k used,  8388600k free,  6571640k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
19593 dirsrv    20   0 4903m 429m  18m R 99.5  0.9   2837:06 ns-slapd
19661 dirsrv    20   0 4903m 429m  18m S 32.6  0.9 105:54.68 ns-slapd
19655 dirsrv    20   0 4903m 429m  18m S 27.0  0.9 106:09.62 ns-slapd
19667 dirsrv    20   0 4903m 429m  18m S 27.0  0.9 106:12.66 ns-slapd
19668 dirsrv    20   0 4903m 429m  18m S 27.0  0.9 104:55.10 ns-slapd
19658 dirsrv    20   0 4903m 429m  18m S 21.0  0.9 106:02.75 ns-slapd
19657 dirsrv    20   0 4903m 429m  18m S 16.6  0.9 105:20.17 ns-slapd
19645 dirsrv    20   0 4903m 429m  18m S 16.3  0.9 104:52.36 ns-slapd
19670 dirsrv    20   0 4903m 429m  18m S 16.3  0.9 105:16.58 ns-slapd
19671 dirsrv    20   0 4903m 429m  18m S 16.3  0.9 106:13.81 ns-slapd
19650 dirsrv    20   0 4903m 429m  18m S 16.0  0.9 105:15.90 ns-slapd
19663 dirsrv    20   0 4903m 429m  18m R 15.6  0.9 105:48.50 ns-slapd
19664 dirsrv    20   0 4903m 429m  18m R 13.3  0.9 106:08.43 ns-slapd
19672 dirsrv    20   0 4903m 429m  18m S 13.3  0.9 105:01.23 ns-slapd
19666 dirsrv    20   0 4903m 429m  18m S 11.3  0.9 105:55.95 ns-slapd
19651 dirsrv    20   0 4903m 429m  18m S 11.0  0.9 106:13.80 ns-slapd
19652 dirsrv    20   0 4903m 429m  18m S 11.0  0.9 105:55.29 ns-slapd
19660 dirsrv    20   0 4903m 429m  18m S 11.0  0.9 106:05.64 ns-slapd
19662 dirsrv    20   0 4903m 429m  18m S 11.0  0.9 105:01.57 ns-slapd
19646 dirsrv    20   0 4903m 429m  18m S 10.6  0.9 104:53.32 ns-slapd
19648 dirsrv    20   0 4903m 429m  18m S 10.6  0.9 106:17.17 ns-slapd
19673 dirsrv    20   0 4903m 429m  18m S 10.6  0.9 105:57.96 ns-slapd
19654 dirsrv    20   0 4903m 429m  18m R  9.3  0.9 105:35.55 ns-slapd
19598 dirsrv    20   0 4903m 429m  18m S  0.3  0.9   0:20.58 ns-slapd
19644 dirsrv    20   0 4903m 429m  18m S  0.3  0.9 106:01.19 ns-slapd
19659 dirsrv    20   0 4903m 429m  18m S  0.3  0.9 105:53.09 ns-slapd
19665 dirsrv    20   0 4903m 429m  18m S  0.3  0.9 104:33.70 ns-slapd
19669 dirsrv    20   0 4903m 429m  18m S  0.3  0.9 106:18.68 ns-slapd
19596 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:32.26 ns-slapd
19597 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:19.50 ns-slapd
19599 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:03.76 ns-slapd
19620 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:00.00 ns-slapd
19621 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:00.00 ns-slapd
19622 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:13.30 ns-slapd
19623 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:11.33 ns-slapd
19624 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:12.87 ns-slapd
19625 dirsrv    20   0 4903m 429m  18m S  0.0  0.9   0:12.54 ns-slapd
==========================================================================

Comment 7 Ludwig 2014-05-08 11:14:23 UTC

Is the new hardware a NUMA machine ? We have seen problems on numa archs, but we need to see pstacks to confirm this.

From the top output it looks like thread 19593 is consuming most of the cpu, from the thread numbers this could be the main thread which is doing connection handling. It would be good to get a sequence of pstacks for the thread with the highest cpu consumption and for some of the other threads as well.

Comment 8 Rich Megginson 2014-05-08 13:49:09 UTC

Created attachment 893633 [details]
.toprc - copy to users $HOME directory before running top

Copy this file to $HOME - if you are running top as root, copy to /root

Then run top like this:

top -b -p $(cat /var/run/dirsrv/slapd-INSTANCE.pid) -d 1 -n 300 > top.out

Where INSTANCE is the name of the instance of dirsrv (usually the hostname or the domain name).

Then attach top.out

Comment 9 Juan Cunanan 2014-05-08 14:22:36 UTC

Ludwig -- no, this is not a NUMA box:

===============================
# dmesg | grep -i numa
No NUMA configuration found
===============================

Attaching the top and pstack run outputs, as requested.

Our clients are mostly running nscd/nslcd though, and not yet sssd. Since we are starting to introduce sssd on newer clients only. Not sure on IPAv3 how it reacts to a huge number of clients not yet using sssd. 

I also did a tcpdump of clients connecting, I'm seeing a good number of LDAP clients who are hitting :389 which are FreeBSD boxes that are not yet using nscd. I'd like to seek advise if this is a scaling issue though, in IPAv1 this wasn't a problem.

Comment 10 Juan Cunanan 2014-05-08 14:23:36 UTC

Created attachment 893640 [details]
pstack

Comment 11 Juan Cunanan 2014-05-08 14:24:04 UTC

Created attachment 893641 [details]
top

Comment 12 Juan Cunanan 2014-05-08 14:25:31 UTC

Created attachment 893646 [details]
partial output of logconv.rpt.

this is partial output of logconv.rpt as we had to stop the run due to our monitors lit red for the IPA box in the middle of the logconv run

Comment 13 Ludwig 2014-05-08 16:14:16 UTC

sorry, haven't asked this before.

you upgraded IPA from 1.1 to 3.0, and you upgraded the hardware, 
what about OS versions. And what version of 389 was used with ipa1.1 and which version is used now ?

Comment 14 Rich Megginson 2014-05-08 16:23:27 UTC

(In reply to Juan Cunanan from comment #11)
> Created attachment 893641 [details]
> top

Sorry, forgot the -H, try it again like this:

top -H -b -p $(cat /var/run/dirsrv/slapd-INSTANCE.pid) -d 1 -n 300 > top.out

Comment 15 Rich Megginson 2014-05-08 16:27:10 UTC

(In reply to Juan Cunanan from comment #12)
> Created attachment 893646 [details]
> partial output of logconv.rpt.
> 
> this is partial output of logconv.rpt as we had to stop the run due to our
> monitors lit red for the IPA box in the middle of the logconv run

The NOT filters are not indexed, nor can they be made indexed

  -  Search Filter:         (&(objectclass=posixgroup)(cn=*)(&(gidnumber=*)(!(gidnumber=0)))(entryusn>=2502534)(!(entryusn=2502534)))

Can the (!(attr=value)) parts be eliminated?

Comment 16 Rich Megginson 2014-05-08 16:28:08 UTC

(In reply to Juan Cunanan from comment #10)
> Created attachment 893640 [details]
> pstack

I find that using gdb gives a lot more information.  Please review the following information about setting up the system to collect useful stack traces using gdb:

http://port389.org/wiki/FAQ#Debugging_Hangs

Comment 17 Juan Cunanan 2014-05-12 22:03:39 UTC

Created attachment 894885 [details]
top v2

Updated top output

Comment 18 Juan Cunanan 2014-05-12 22:26:23 UTC

(In reply to Rich Megginson from comment #14)
> (In reply to Juan Cunanan from comment #11)
> > Created attachment 893641 [details]
> > top
> 
> Sorry, forgot the -H, try it again like this:
> 
> top -H -b -p $(cat /var/run/dirsrv/slapd-INSTANCE.pid) -d 1 -n 300 > top.out


Rich, I've attached the top output in this thread.

Comment 19 Juan Cunanan 2014-05-12 22:30:00 UTC

(In reply to Ludwig from comment #13)
> sorry, haven't asked this before.
> 
> you upgraded IPA from 1.1 to 3.0, and you upgraded the hardware, 
> what about OS versions. And what version of 389 was used with ipa1.1 and
> which version is used now ?

Ludwig, we're using this 389:
389-ds-base-1.2.11.15-30.el6_5.x86_64
389-ds-base-libs-1.2.11.15-30.el6_5.x86_64

On the IPA v1 box, we were using before:
redhat-ds-8.1.0-1.el5dsrv.x86_64.rpm

Comment 20 Juan Cunanan 2014-05-12 22:48:07 UTC

Created attachment 894897 [details]
gdb_stacktrace1

Comment 21 Juan Cunanan 2014-05-12 22:48:33 UTC

Created attachment 894898 [details]
gdb_stacktrace2

Comment 22 Rich Megginson 2014-05-13 02:18:03 UTC

Looks like the reslimit code, for the anonymous resource limits.  There is a trememdous amount of time spent in the main daemon polling thread.  Unfortunately, the reslimit code is doing an internal search, inside the polling thread.  This fixed upstream, targeted for RHEL 7.1:
https://fedorahosted.org/389/ticket/47426

In the meantime, try turning off limits for anonymous users to see if that helps:

https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Administration_Guide/Finding_Directory_Entries.html#Setting_Resource_Limits_for-anonymous-binds

Comment 23 Juan Cunanan 2014-05-13 09:48:30 UTC

I've read out the guide on removing this, can you please guide us on the proper values that we need to place on this configuration...

ldapmodify -a -D "cn=directory manager" -w secret -p 389 -h server.example.com -x

dn: cn=anon template,ou=people,dc=example,dc=com
changetype: add
objectclass: person
objectclass: top
cn: anon template
sn: template
nsSizeLimit: 250
nsLookThroughLimit: 1000
nsTimeLimit: 60

Comment 24 Rich Megginson 2014-05-13 13:10:14 UTC

you need to turn this off in cn=config

Comment 25 Juan Cunanan 2014-05-14 06:43:42 UTC

(In reply to Rich Megginson from comment #24)
> you need to turn this off in cn=config

How do we turn off the Limits for the anonymous query? If you can give the exact configs needed, that will be appreciated as the box is in production.

Thanks.

Comment 26 Ludwig 2014-05-14 06:49:45 UTC

I'm not sure what Rich exactly meant, thought you had to remove the anon limits entry. Could you post your current dse.ldif ?

Comment 27 Ludwig 2014-05-14 07:17:44 UTC

In my tests I could trigger the calls to reslimit_update_from_dn() by stetting/removing the "nsslapd-anonlimitsdn" attribute in cn=config
Unfortunately the removal was effective only after restart

so you could stop the server, remove the nsslapd-anonlimitsdn lin in the dse.ldif and restart

Comment 28 Juan Cunanan 2014-05-14 07:31:25 UTC

Created attachment 895389 [details]
dse.ldif

Attaching the current dse.ldif for the Master box. This doesn't have the nsslapd-anonlimitsdn line removed yet.

Comment 29 Juan Cunanan 2014-05-14 08:58:23 UTC

Created attachment 895410 [details]
top v3 - after remove

Okay, removed the nsslapd-anonlimitsdn line.

First observations:
- query performance from clients side has improved across the board.

======================================================================
Client: CentOS 6.2, no sssd, only using LDAP with nscd/nslcd:

before:

[root ~]# time id a-juanc
uid=10474(a-juanc) gid=10474(a-juanc) groups=1101,1114,1151,1152,1172,1217,1218,1224,10474

real    0m23.818s
user    0m0.000s
sys     0m0.004s
[root ~]# time id a-juanc
uid=10474(a-juanc) gid=10474(a-juanc) groups=1101,1114(dcs-rd),1151(ap-rd),1152(ap-rd-rw),1172(dcs-soc-l2),1217(dcs-shared),1218(dcs-splunk),1224(dcs-zadmin),10474(a-juanc)

real    0m13.759s
user    0m0.001s
sys     0m0.003s


after:

[root ~]# time id a-juanc
uid=10474(a-juanc) gid=10474(a-juanc) groups=1101(dcs-soc),1114(dcs-rd),1151,1152(ap-rd-rw),1172(dcs-soc-l2),1217(dcs-shared),1218(dcs-splunk),1224(dcs-zadmin),10474(a-juanc)

real    0m0.002s
user    0m0.000s
sys     0m0.002s

[root ~]# time id ztrend
uid=10837(ztrend) gid=10837(ztrend) groups=25000(admins),25002(editors),1101(dcs-soc),24990(service_account),10837(ztrend)

real    0m0.020s
user    0m0.001s
sys     0m0.001s

[root ~]# time id a-nick_gregory
id: a-nick_gregory: No such user

real    0m0.004s
user    0m0.000s
sys     0m0.001s
======================================================================

- Overall user experience with and without the aid of caching sssd/nscd/nslcd has been better

- server ns-slapd performance gets lower (top attached). Although spikes of > 100%, sometimes going to 200~700% are still visible. 
* we're graphing the performance of ns-slapd process to check this

Comment 30 Martin Kosek 2014-05-27 10:29:32 UTC

Ludwig or Juan, what would you like to do with this bug? Should I move it to 389-ds-base and link to ticket 47426?

Comment 31 Ludwig 2014-05-27 11:18:49 UTC

yes, the issue id fixed in 47426, so you can refernce it

Comment 32 Martin Kosek 2014-05-27 13:28:28 UTC

Ok, thanks. Moving the Bugzilla 389-ds-base and referencing upstream ticket https://fedorahosted.org/389/ticket/47426

Comment 33 Juan Cunanan 2014-05-29 23:19:23 UTC

I'm good with this, our setup is stable again. 

Master PID thread is no longer consuming the CPU and resolution with or without caching aide is satisfactory.

Comment 37 Sankar Ramalingam 2014-09-17 18:09:16 UTC

Recent reliab15 execution against build 1.2.11.15-45 running fine. No issues reported from the execution. Hence, marking the bug as Verified as Sanity only.

Beaker job - https://beaker.engineering.redhat.com/jobs/749819

Comment 38 errata-xmlrpc 2014-10-14 07:55:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1385.html