Bug 1292559 - ns-slapd high cpu in ipa context and stack traces reviews [NEEDINFO]
ns-slapd high cpu in ipa context and stack traces reviews
Status: CLOSED DUPLICATE of bug 1244957
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd (Show other bugs)
7.2
Unspecified Unspecified
unspecified Severity urgent
: rc
: ---
Assigned To: SSSD Maintainers
Steeve Goveas
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-17 14:18 EST by Marc Sauton
Modified: 2016-02-16 10:40 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-16 10:40:07 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
jhrozek: needinfo? (msauton)


Attachments (Terms of Use)

  None (edit)
Description Marc Sauton 2015-12-17 14:18:07 EST
Description of problem:

high cpu use report after RHEL 7.2 errata installed, 2 ipa masters:
"
Yes they are both showing high utilization.  Right now it seems to be alternating.  I restart the daemon on the server showing high load and the high load flips to the other server after a while.
"

this is a bug report for stack trace reviews.

related to salesforce case number 01550492

the customer's attachments are on the system 
  fubar.gsslab.rdu2.redhat.com
ssh using your uid/pwd credentials as with krb

then, for example:
less /fubar/01550492/140-sf.ipa02.01550492.stacktrace.1450374312.txt


not sure if relevant to this case with the 7.2 update, or SSSD clients update, because this is not really new, but the sudohost LDAP search filters from SSSD look super inefficient to me, could be a cause of high cpu in ns-slapd, as those filters appear in the very top list of a logconv on the access log, this is about the part with the "scary" substring searches like this one:

(|(!(sudohost=*))(sudohost=all)(sudohost=mdhixdevaimt01.mhbenp.lin)(sudohost=mdhixdevaimt01)(sudohost=10.161.84.85)(sudohost=10.161.84.64/26)(sudohost=+*)(|(sudohost=*\5c*)(sudohost=*?*)(sudohost=*\2a*)(sudohost=*[*]*))))

see details below.


the first stack trace provided did not show anything relevant, in 
110-sf.01550492.stacktrace.1450278263.txt
Per Noriko's comment "thread 2 and 5 used by sync plug-in, which is in sync_send_results -- sending the search result to the client."


pstack outpus on ipa01 during high cpu:
150-pstack.ipa01.20151217-134210.out
160-pstack.ipa01.20151217-134644.out


the following attachment was during a high cpu instance:
140-sf.ipa02.01550492.stacktrace.1450374312.txt

8GB ram and 5GB free

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
16527 dirsrv    20   0 2951388 345764  84776 S 200.0  4.3 102:25.01 ns-slapd


the following 2 attachments were very likely taken on ipa01 and 02 with NO high cpu use:
130-sf.ipa02.01550492.stacktrace.1450366710.txt
120-sf.ipa01.01550492.stacktrace.1450366724.txt



Version-Release number of selected component (if applicable):

Red Hat Enterprise Linux Server release 7.2 (Maipo)
Linux mdhixnpipa01.mhbenp.lin 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
389-ds-base-1.3.4.0-21.el7_2.x86_64
ipa-server-4.2.0-15.el7_2.3.x86_64


How reproducible:
N/A

Steps to Reproduce:
1. N/A
2.
3.

Actual results:


Expected results:


Additional info:

about the 7.2 update rpm versions:
attachment
90-sosreport-mdhixnpipa01.mhbenp.lin.01550492-20151211071420.tar.xz
file
var/log/yum.log 
Dec 01 06:53:05 Updated: redhat-release-server-7.2-9.el7.x86_64
Dec 01 06:55:08 Updated: 389-ds-base-1.3.4.0-19.el7.x86_64
Dec 01 06:55:09 Updated: sssd-1.13.0-40.el7.x86_64
Dec 01 06:55:40 Installed: ipa-server-4.2.0-15.el7.x86_64

and
Dec 08 22:29:45 Updated: slapi-nis-0.54-6.el7_2.x86_64



other:

would like to point out the following pattern in the access log from SSSD client searches with paged results, I do not know if this is related to the high cpu, but in case, part of the "investigation":

in attachment
90-sosreport-mdhixnpipa01.mhbenp.lin.01550492-20151211071420.tar.xz
file
var/log/dirsrv/slapd-MHBENP-LIN/access

out of
Paged Searches:               18735
Searches:                     23573         (26.19/sec)  (1571.53/min)


file
var/log/dirsrv/slapd-MHBENP-LIN/access

IPA traffic patterns:

Searches:                     23573         (26.19/sec)  (1571.53/min)
Modifications:                1             (0.00/sec)  (0.07/min)
Extended Operations:          1138
Paged Searches:               18735

Number of Unique Search Bases: 143

18682           ou=sudoers,dc=mhbenp,dc=lin
1989            cn=default trust view,cn=views,cn=accounts,dc=mhbenp,dc=lin
985             dc=mhbenp,dc=lin
534             cn=mhbenp.lin,cn=kerberos,dc=mhbenp,dc=lin
490             root dse
346             cn=global_policy,cn=mhbenp.lin,cn=kerberos,dc=mhbenp,dc=lin
102             cn=accounts,dc=mhbenp,dc=lin


see the top filters:

Number of Unique Search Filters: 412

5795            (&(&(objectclass=sudorole)(entryusn>=6127114)(!(entryusn=6127114)))(|(!(sudohost=*))(sudohost=all)(sudohost=mdhixdevaimt01.mhbenp.lin)(sudohost=mdhixdevaimt01)(sudohost=10.161.84.85)(sudohost=10.161.84.64/26)(sudohost=+*)(|(sudohost=*\5c*)(sudohost=*?*)(sudohost=*\2a*)(sudohost=*[*]*))))
2283            (&(&(objectclass=sudorole)(entryusn>=6127114)(!(entryusn=6127114)))(|(!(sudohost=*))(sudohost=all)(sudohost=mdhixuatitam04.mhbenp.lin)(sudohost=mdhixuatitam04)(sudohost=10.161.81.206)(sudohost=10.161.81.192/26)(sudohost=+*)(|(sudohost=*\5c*)(sudohost=*?*)(sudohost=*\2a*)(sudohost=*[*]*))))
1445            (&(&(objectclass=sudorole)(entryusn>=6127114)(!(entryusn=6127114)))(|(!(sudohost=*))(sudohost=all)(sudohost=mdhixdevsvn01.mhbenp.lin)(sudohost=mdhixdevsvn01)(sudohost=10.161.84.104)(sudohost=10.161.84.64/26)(sudohost=+*)(|(sudohost=*\5c*)(sudohost=*?*)(sudohost=*\2a*)(sudohost=*[*]*))))
1114            (objectclass=*)
621             (&(&(objectclass=sudorole)(entryusn>=6127114)(!(entryusn=6127114)))(|(!(sudohost=*))(sudohost=all)(sudohost=mdhixuatitamrep04.mhbenp.lin)(sudohost=mdhixuatitamrep04)(sudohost=10.161.81.210)(sudohost=10.161.81.192/26)(sudohost=+*)(|(sudohost=*\5c*)(sudohost=*?*)(sudohost=*\2a*)(sudohost=*[*]*))))
534             (objectclass=krbticketpolicyaux)
438             (&(|(objectclass=krbprincipalaux)(objectclass=krbprincipal)(objectclass=ipakrbprincipal))(|(ipakrbprincipalalias=krbtgt/mhbenp.lin@mhbenp.lin)(krbprincipalname=krbtgt/mhbenp.lin@mhbenp.lin)))
369             (&(&(objectclass=sudorole)(entryusn>=6127114)(!(entryusn=6127114)))(|(!(sudohost=*))(sudohost=all)(sudohost=mdhixuatsecdb01.mhbenp.lin)(sudohost=mdhixuatsecdb01)(sudohost=10.161.81.145)(sudohost=10.161.81.128/26)(sudohost=+*)(|(sudohost=*\5c*)(sudohost=*?*)(sudohost=*\2a*)(sudohost=*[*]*))))



the non etime=0 of paged results are those:

[11/Dec/2015:07:01:07 -0500] conn=79232 op=44 RESULT err=0 tag=101 nentries=0 etime=6 notes=P pr_idx=12
[11/Dec/2015:07:01:07 -0500] conn=78997 op=154 RESULT err=0 tag=101 nentries=9 etime=6 notes=P pr_idx=126
[11/Dec/2015:07:01:07 -0500] conn=79021 op=113 RESULT err=0 tag=101 nentries=0 etime=6 notes=P pr_idx=95
[11/Dec/2015:07:01:07 -0500] conn=78946 op=2581 RESULT err=0 tag=101 nentries=0 etime=3 notes=P pr_idx=2489
[11/Dec/2015:07:01:07 -0500] conn=78916 op=142 RESULT err=0 tag=101 nentries=0 etime=1 notes=P pr_idx=126
[11/Dec/2015:07:01:07 -0500] conn=78911 op=262 RESULT err=0 tag=101 nentries=9 etime=3 notes=P pr_idx=240
[11/Dec/2015:07:07:53 -0500] conn=79163 op=217 RESULT err=0 tag=101 nentries=0 etime=1 notes=P pr_idx=199
[11/Dec/2015:07:12:48 -0500] conn=79246 op=333 RESULT err=0 tag=101 nentries=9 etime=1 notes=P pr_idx=239
[11/Dec/2015:07:12:59 -0500] conn=79502 op=1246 RESULT err=0 tag=101 nentries=0 etime=1 notes=P pr_idx=1200


not sure if related to high cpu use or coincidence, but the 3 highest etime for paged results are showing the same SUDO SSSD filters:

Dec/2015:07:01:07 -0500] conn=79232 op=44 SRCH base="ou=SUDOers,dc=mhbenp,dc=lin" scope=2 filter="(&(&(objectClass=sudoRole)(entryusn>=6127114)(!(entryusn=6127114)))(|(!(sudoHost=*))(sudoHost=ALL)(sudoHost=mdhixtrnisim02.mhbenp.lin)(sudoHost=mdhixtrnisim02)(sudoHost=10.161.80.144)(sudoHost=10.161.80.128/26)(sudoHost=+*)(|(sudoHost=*\5C*)(sudoHost=*?*)(sudoHost=*\2A*)(sudoHost=*[*]*))))" attrs="objectClass cn sudoCommand sudoHost sudoUser sudoOption sudoRunAs sudoRunAsUser sudoRunAsGroup sudoNotBefore sudoNotAfter sudoOrder entryusn"
[11/Dec/2015:07:01:07 -0500] conn=79232 op=44 RESULT err=0 tag=101 nentries=0 etime=6 notes=P pr_idx=12

and
[11/Dec/2015:07:01:07 -0500] conn=78997 op=154 SRCH base="ou=SUDOers,dc=mhbenp,dc=lin" scope=2 filter="(&(objectClass=sudoRole)(|(!(sudoHost=*))(sudoHost=ALL)(sudoHost=mdhixuatsmtp01.mhbenp.lin)(sudoHost=mdhixuatsmtp01)(sudoHost=10.161.81.100)(sudoHost=10.161.81.64/26)(sudoHost=+*)(|(sudoHost=*\5C*)(sudoHost=*?*)(sudoHost=*\2A*)(sudoHost=*[*]*))))" attrs="objectClass cn sudoCommand sudoHost sudoUser sudoOption sudoRunAs sudoRunAsUser sudoRunAsGroup sudoNotBefore sudoNotAfter sudoOrder entryusn"
[11/Dec/2015:07:01:07 -0500] conn=78997 op=154 RESULT err=0 tag=101 nentries=9 etime=6 notes=P pr_idx=126

and
[11/Dec/2015:07:01:07 -0500] conn=79021 op=113 SRCH base="ou=SUDOers,dc=mhbenp,dc=lin" scope=2 filter="(&(&(objectClass=sudoRole)(entryusn>=6127114)(!(entryusn=6127114)))(|(!(sudoHost=*))(sudoHost=ALL)(sudoHost=mdhixtrnsecweb02.mhbenp.lin)(sudoHost=mdhixtrnsecweb02)(sudoHost=10.161.80.16)(sudoHost=10.161.80.0/26)(sudoHost=+*)(|(sudoHost=*\5C*)(sudoHost=*?*)(sudoHost=*\2A*)(sudoHost=*[*]*))))" attrs="objectClass cn sudoCommand sudoHost sudoUser sudoOption sudoRunAs sudoRunAsUser sudoRunAsGroup sudoNotBefore sudoNotAfter sudoOrder entryusn"
[11/Dec/2015:07:01:07 -0500] conn=79021 op=113 RESULT err=0 tag=101 nentries=0 etime=6 notes=P pr_idx=95


only errors are
err=0                 23588    Successful Operations
err=32                 1212    No Such Object
err=14                  184    SASL Bind in Progress
err=48                    1    Inappropriate Authentication (No password presented, etc)


and there are nearly no indexes for SUDO, I wonder if we should have an index for sudoHost or why we do not have any.



other:

high cpu seem to appear right after the RHEL 7.2 update and ns-slapd start:

attachment
90-sosreport-mdhixnpipa01.mhbenp.lin.01550492-20151211071420.tar.xz

per file
var/log/yum.log
the 7.2 update was done on
Dec 01 06:53:05 Updated: redhat-release-server-7.2-9.el7.x86_64

file
var/log/dirsrv/slapd-MHBENP-LIN/errors.20151127-130020
[01/Dec/2015:06:55:09 -0500] - 389-Directory/1.3.4.0 B2015.278.1946 starting up

file
var/log/sa/sar01 
Linux 3.10.0-229.14.1.el7.x86_64 (mdhixnpipa01.mhbenp.lin)      2015-12-01      _x86_64_        (2 CPU)

12:00:01 AM     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle

06:50:01 AM       0      3.51      0.00      0.43      0.17      0.00      0.00      0.03      0.00      0.00     95.86
06:50:01 AM       1      3.57      0.00      0.44      0.28      0.00      0.00      0.02      0.00      0.00     95.68
06:40:47 AM     all     24.65      0.00      9.02      7.30      0.00      0.00      0.29      0.00      0.00     58.73
06:40:47 AM       0     24.83      0.00      9.20      7.69      0.00      0.00      0.43      0.00      0.00     57.85
06:40:47 AM       1     24.47      0.00      8.84      6.92      0.00      0.00      0.16      0.00      0.00     59.61
07:00:01 AM     all     37.41      0.00      7.57      1.64      0.00      0.00      0.14      0.00      0.00     53.24
07:00:01 AM       0     37.46      0.00      8.23      1.64      0.00      0.00      0.09      0.00      0.00     52.58
07:00:01 AM       1     37.36      0.00      6.91      1.65      0.00      0.00      0.19      0.00      0.00     53.90
...snip...


then

Linux 3.10.0-327.el7.x86_64 (mdhixnpipa01.mhbenp.lin)   2015-12-02      _x86_64_        (2 CPU)

12:00:02 AM     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle
12:10:02 AM     all     67.09      0.00     32.88      0.00      0.00      0.00      0.02      0.00      0.00      0.00
12:10:02 AM       0     67.06      0.00     32.92      0.00      0.00      0.00      0.02      0.00      0.00      0.00
...
08:10:01 AM     all     50.22      0.00     15.05      0.47      0.00      0.00      0.02      0.00      0.00     34.23
08:10:01 AM       0     48.89      0.00     15.01      0.36      0.00      0.00      0.02      0.00      0.00     35.73
08:10:01 AM       1     51.56      0.00     15.10      0.59      0.00      0.00      0.02      0.00      0.00     32.73
08:20:01 AM     all      0.30      0.00      0.22      0.02      0.00      0.00      0.01      0.00      0.00     99.44
Comment 1 Marc Sauton 2015-12-17 14:25:27 EST
changing the priority to "urgent", as the customer case 01550492 is a sev1
Comment 3 Marc Sauton 2015-12-17 14:39:19 EST
there is something else:

attachment

90-sosreport-mdhixnpipa01.mhbenp.lin.01550492-20151211071420.tar.xz
file
var/log/dirsrv/slapd-MHBENP-LIN/errors

[11/Dec/2015:07:13:01 -0500] NSMMReplicationPlugin - agmt="cn=meTomdhixnpipa02.mhbenp.lin" (mdhixnpipa02:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't contact LDAP server) ()

over time, could a broken replication from no longer valid Krb LDAP service principal, contribute to this high ns-slapd issue?
Comment 4 Marc Sauton 2015-12-17 16:27:25 EST
relaying some customer notes:
"
Clients are all RHEL6.7 in this environment running sssd-1.12.4-47.el6.x86_6.  They were all patched when I patched the IPA servers the first of the month.
"

then I noticed this

dn: cn=changelog,cn=ldbm database,cn=plugins,cn=config
nsslapd-cachememsize: 512000
nsslapd-directory: /var/lib/dirsrv/slapd-MHBENP-LIN/db/changelog

dn: cn=config,cn=ldbm database,cn=plugins,cn=config
nsslapd-lookthroughlimit: 100000
nsslapd-idlistscanlimit: 100000
nsslapd-directory: /var/lib/dirsrv/slapd-MHBENP-LIN/db
nsslapd-dbcachesize: 1073741824

dn: cn=userRoot,cn=ldbm database,cn=plugins,cn=config
nsslapd-cachememsize: 2147483648
nsslapd-directory: /var/lib/dirsrv/slapd-MHBENP-LIN/db/userRoot

dn: cn=ipaca,cn=ldbm database,cn=plugins,cn=config
nsslapd-cachememsize: 1073741824
nsslapd-directory: /var/lib/dirsrv/slapd-MHBENP-LIN/db/ipaca


while the db looks like this, only:
ls -lh /var/lib/dirsrv/slapd-*/db/userRoot/
...snip...
-rw-------. 1 dirsrv dirsrv 2.0M Dec 17 13:57 id2entry.db
...snip...

so the entry and db cache seem high for what is needed.
should we reduce them?
Comment 5 Marc Sauton 2015-12-17 16:31:03 EST
with the sudoHost filters requested by the SSSD RHEL 6.7 clients, should we tune down the following:

nsslapd-lookthroughlimit: 100000
nsslapd-idlistscanlimit: 100000
?
Comment 14 Jakub Hrozek 2016-01-14 11:06:41 EST
This should be solved by https://bugzilla.redhat.com/show_bug.cgi?id=1244957
Comment 15 Jakub Hrozek 2016-01-18 16:18:30 EST
We are very close to pushing the patches for the native IPA sudo support to master. Would the customer be interested in testing?

If not, can I go ahead and close this bugzilla?
Comment 16 Jakub Hrozek 2016-02-16 10:40:07 EST
This BZ has been needinfo's for a month and the problem (even though solved by config changes for the original customer) would best be solved by using the native IPA sudo provider which was added to 6.8 and will be added to 7.3.

Closing.

*** This bug has been marked as a duplicate of bug 1244957 ***

Note You need to log in before you can comment on or make changes to this bug.