Bug 1094277
Summary: | IPA Server Slow Performance, high CPU usage of ns-slapd | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Juan Cunanan <juan_cunanan> | ||||||||||||||||||||
Component: | 389-ds-base | Assignee: | Noriko Hosoi <nhosoi> | ||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Sankar Ramalingam <sramling> | ||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||||
Version: | 6.0 | CC: | abokovoy, jgalipea, juan_cunanan, lkrispen, nhosoi, nkinder, nsoman, rcritten, rmeggins | ||||||||||||||||||||
Target Milestone: | pre-dev-freeze | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | 389-ds-base-1.2.11.15-34.el6 | Doc Type: | Bug Fix | ||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2014-10-14 07:55:00 UTC | Type: | Bug | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Attachments: |
|
Description
Juan Cunanan
2014-05-05 11:55:27 UTC
Was the hardware also changed with the upgrade ? Could you provide the output of top -H -p <slapd-pid> and a sequence of pstacks for the time of high cpu usage ? Could you please also run: Please have the customer run logconv.pl -V /var/log/dirsrv/slapd-DOMAIN-TLD/access.20* /var/log/dirsrv/slapd-DOMAIN-TLD/access > logconv.rpt 2>&1 and attach logconv.rpt to the bug/ticket Yes hardware was changed and upgraded when 1.1 to 3.0 upgrade was done. The new machine is more powerful. old machine was a 4 core/16G RAM server. New machine is 8core/48G RAM server. top output: ========================================================================== top - 08:16:22 up 54 days, 23:20, 3 users, load average: 2.79, 2.10, 1.66 Tasks: 58 total, 4 running, 54 sleeping, 0 stopped, 0 zombie Cpu(s): 68.0%us, 1.5%sy, 0.0%ni, 30.3%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 49417172k total, 11090812k used, 38326360k free, 158300k buffers Swap: 8388600k total, 0k used, 8388600k free, 6571640k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19593 dirsrv 20 0 4903m 429m 18m R 99.5 0.9 2837:06 ns-slapd 19661 dirsrv 20 0 4903m 429m 18m S 32.6 0.9 105:54.68 ns-slapd 19655 dirsrv 20 0 4903m 429m 18m S 27.0 0.9 106:09.62 ns-slapd 19667 dirsrv 20 0 4903m 429m 18m S 27.0 0.9 106:12.66 ns-slapd 19668 dirsrv 20 0 4903m 429m 18m S 27.0 0.9 104:55.10 ns-slapd 19658 dirsrv 20 0 4903m 429m 18m S 21.0 0.9 106:02.75 ns-slapd 19657 dirsrv 20 0 4903m 429m 18m S 16.6 0.9 105:20.17 ns-slapd 19645 dirsrv 20 0 4903m 429m 18m S 16.3 0.9 104:52.36 ns-slapd 19670 dirsrv 20 0 4903m 429m 18m S 16.3 0.9 105:16.58 ns-slapd 19671 dirsrv 20 0 4903m 429m 18m S 16.3 0.9 106:13.81 ns-slapd 19650 dirsrv 20 0 4903m 429m 18m S 16.0 0.9 105:15.90 ns-slapd 19663 dirsrv 20 0 4903m 429m 18m R 15.6 0.9 105:48.50 ns-slapd 19664 dirsrv 20 0 4903m 429m 18m R 13.3 0.9 106:08.43 ns-slapd 19672 dirsrv 20 0 4903m 429m 18m S 13.3 0.9 105:01.23 ns-slapd 19666 dirsrv 20 0 4903m 429m 18m S 11.3 0.9 105:55.95 ns-slapd 19651 dirsrv 20 0 4903m 429m 18m S 11.0 0.9 106:13.80 ns-slapd 19652 dirsrv 20 0 4903m 429m 18m S 11.0 0.9 105:55.29 ns-slapd 19660 dirsrv 20 0 4903m 429m 18m S 11.0 0.9 106:05.64 ns-slapd 19662 dirsrv 20 0 4903m 429m 18m S 11.0 0.9 105:01.57 ns-slapd 19646 dirsrv 20 0 4903m 429m 18m S 10.6 0.9 104:53.32 ns-slapd 19648 dirsrv 20 0 4903m 429m 18m S 10.6 0.9 106:17.17 ns-slapd 19673 dirsrv 20 0 4903m 429m 18m S 10.6 0.9 105:57.96 ns-slapd 19654 dirsrv 20 0 4903m 429m 18m R 9.3 0.9 105:35.55 ns-slapd 19598 dirsrv 20 0 4903m 429m 18m S 0.3 0.9 0:20.58 ns-slapd 19644 dirsrv 20 0 4903m 429m 18m S 0.3 0.9 106:01.19 ns-slapd 19659 dirsrv 20 0 4903m 429m 18m S 0.3 0.9 105:53.09 ns-slapd 19665 dirsrv 20 0 4903m 429m 18m S 0.3 0.9 104:33.70 ns-slapd 19669 dirsrv 20 0 4903m 429m 18m S 0.3 0.9 106:18.68 ns-slapd 19596 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:32.26 ns-slapd 19597 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:19.50 ns-slapd 19599 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:03.76 ns-slapd 19620 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:00.00 ns-slapd 19621 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:00.00 ns-slapd 19622 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:13.30 ns-slapd 19623 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:11.33 ns-slapd 19624 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:12.87 ns-slapd 19625 dirsrv 20 0 4903m 429m 18m S 0.0 0.9 0:12.54 ns-slapd ========================================================================== Is the new hardware a NUMA machine ? We have seen problems on numa archs, but we need to see pstacks to confirm this. From the top output it looks like thread 19593 is consuming most of the cpu, from the thread numbers this could be the main thread which is doing connection handling. It would be good to get a sequence of pstacks for the thread with the highest cpu consumption and for some of the other threads as well. Created attachment 893633 [details]
.toprc - copy to users $HOME directory before running top
Copy this file to $HOME - if you are running top as root, copy to /root
Then run top like this:
top -b -p $(cat /var/run/dirsrv/slapd-INSTANCE.pid) -d 1 -n 300 > top.out
Where INSTANCE is the name of the instance of dirsrv (usually the hostname or the domain name).
Then attach top.out
Ludwig -- no, this is not a NUMA box: =============================== # dmesg | grep -i numa No NUMA configuration found =============================== Attaching the top and pstack run outputs, as requested. Our clients are mostly running nscd/nslcd though, and not yet sssd. Since we are starting to introduce sssd on newer clients only. Not sure on IPAv3 how it reacts to a huge number of clients not yet using sssd. I also did a tcpdump of clients connecting, I'm seeing a good number of LDAP clients who are hitting :389 which are FreeBSD boxes that are not yet using nscd. I'd like to seek advise if this is a scaling issue though, in IPAv1 this wasn't a problem. Created attachment 893640 [details]
pstack
Created attachment 893641 [details]
top
Created attachment 893646 [details]
partial output of logconv.rpt.
this is partial output of logconv.rpt as we had to stop the run due to our monitors lit red for the IPA box in the middle of the logconv run
sorry, haven't asked this before. you upgraded IPA from 1.1 to 3.0, and you upgraded the hardware, what about OS versions. And what version of 389 was used with ipa1.1 and which version is used now ? (In reply to Juan Cunanan from comment #11) > Created attachment 893641 [details] > top Sorry, forgot the -H, try it again like this: top -H -b -p $(cat /var/run/dirsrv/slapd-INSTANCE.pid) -d 1 -n 300 > top.out (In reply to Juan Cunanan from comment #12) > Created attachment 893646 [details] > partial output of logconv.rpt. > > this is partial output of logconv.rpt as we had to stop the run due to our > monitors lit red for the IPA box in the middle of the logconv run The NOT filters are not indexed, nor can they be made indexed - Search Filter: (&(objectclass=posixgroup)(cn=*)(&(gidnumber=*)(!(gidnumber=0)))(entryusn>=2502534)(!(entryusn=2502534))) Can the (!(attr=value)) parts be eliminated? (In reply to Juan Cunanan from comment #10) > Created attachment 893640 [details] > pstack I find that using gdb gives a lot more information. Please review the following information about setting up the system to collect useful stack traces using gdb: http://port389.org/wiki/FAQ#Debugging_Hangs Created attachment 894885 [details]
top v2
Updated top output
(In reply to Rich Megginson from comment #14) > (In reply to Juan Cunanan from comment #11) > > Created attachment 893641 [details] > > top > > Sorry, forgot the -H, try it again like this: > > top -H -b -p $(cat /var/run/dirsrv/slapd-INSTANCE.pid) -d 1 -n 300 > top.out Rich, I've attached the top output in this thread. (In reply to Ludwig from comment #13) > sorry, haven't asked this before. > > you upgraded IPA from 1.1 to 3.0, and you upgraded the hardware, > what about OS versions. And what version of 389 was used with ipa1.1 and > which version is used now ? Ludwig, we're using this 389: 389-ds-base-1.2.11.15-30.el6_5.x86_64 389-ds-base-libs-1.2.11.15-30.el6_5.x86_64 On the IPA v1 box, we were using before: redhat-ds-8.1.0-1.el5dsrv.x86_64.rpm Created attachment 894897 [details]
gdb_stacktrace1
Created attachment 894898 [details]
gdb_stacktrace2
Looks like the reslimit code, for the anonymous resource limits. There is a trememdous amount of time spent in the main daemon polling thread. Unfortunately, the reslimit code is doing an internal search, inside the polling thread. This fixed upstream, targeted for RHEL 7.1: https://fedorahosted.org/389/ticket/47426 In the meantime, try turning off limits for anonymous users to see if that helps: https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Administration_Guide/Finding_Directory_Entries.html#Setting_Resource_Limits_for-anonymous-binds I've read out the guide on removing this, can you please guide us on the proper values that we need to place on this configuration... ldapmodify -a -D "cn=directory manager" -w secret -p 389 -h server.example.com -x dn: cn=anon template,ou=people,dc=example,dc=com changetype: add objectclass: person objectclass: top cn: anon template sn: template nsSizeLimit: 250 nsLookThroughLimit: 1000 nsTimeLimit: 60 you need to turn this off in cn=config (In reply to Rich Megginson from comment #24) > you need to turn this off in cn=config How do we turn off the Limits for the anonymous query? If you can give the exact configs needed, that will be appreciated as the box is in production. Thanks. I'm not sure what Rich exactly meant, thought you had to remove the anon limits entry. Could you post your current dse.ldif ? In my tests I could trigger the calls to reslimit_update_from_dn() by stetting/removing the "nsslapd-anonlimitsdn" attribute in cn=config Unfortunately the removal was effective only after restart so you could stop the server, remove the nsslapd-anonlimitsdn lin in the dse.ldif and restart Created attachment 895389 [details]
dse.ldif
Attaching the current dse.ldif for the Master box. This doesn't have the nsslapd-anonlimitsdn line removed yet.
Created attachment 895410 [details]
top v3 - after remove
Okay, removed the nsslapd-anonlimitsdn line.
First observations:
- query performance from clients side has improved across the board.
======================================================================
Client: CentOS 6.2, no sssd, only using LDAP with nscd/nslcd:
before:
[root ~]# time id a-juanc
uid=10474(a-juanc) gid=10474(a-juanc) groups=1101,1114,1151,1152,1172,1217,1218,1224,10474
real 0m23.818s
user 0m0.000s
sys 0m0.004s
[root ~]# time id a-juanc
uid=10474(a-juanc) gid=10474(a-juanc) groups=1101,1114(dcs-rd),1151(ap-rd),1152(ap-rd-rw),1172(dcs-soc-l2),1217(dcs-shared),1218(dcs-splunk),1224(dcs-zadmin),10474(a-juanc)
real 0m13.759s
user 0m0.001s
sys 0m0.003s
after:
[root ~]# time id a-juanc
uid=10474(a-juanc) gid=10474(a-juanc) groups=1101(dcs-soc),1114(dcs-rd),1151,1152(ap-rd-rw),1172(dcs-soc-l2),1217(dcs-shared),1218(dcs-splunk),1224(dcs-zadmin),10474(a-juanc)
real 0m0.002s
user 0m0.000s
sys 0m0.002s
[root ~]# time id ztrend
uid=10837(ztrend) gid=10837(ztrend) groups=25000(admins),25002(editors),1101(dcs-soc),24990(service_account),10837(ztrend)
real 0m0.020s
user 0m0.001s
sys 0m0.001s
[root ~]# time id a-nick_gregory
id: a-nick_gregory: No such user
real 0m0.004s
user 0m0.000s
sys 0m0.001s
======================================================================
- Overall user experience with and without the aid of caching sssd/nscd/nslcd has been better
- server ns-slapd performance gets lower (top attached). Although spikes of > 100%, sometimes going to 200~700% are still visible.
* we're graphing the performance of ns-slapd process to check this
Ludwig or Juan, what would you like to do with this bug? Should I move it to 389-ds-base and link to ticket 47426? yes, the issue id fixed in 47426, so you can refernce it Ok, thanks. Moving the Bugzilla 389-ds-base and referencing upstream ticket https://fedorahosted.org/389/ticket/47426 I'm good with this, our setup is stable again. Master PID thread is no longer consuming the CPU and resolution with or without caching aide is satisfactory. Recent reliab15 execution against build 1.2.11.15-45 running fine. No issues reported from the execution. Hence, marking the bug as Verified as Sanity only. Beaker job - https://beaker.engineering.redhat.com/jobs/749819 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1385.html |