Bug 1482231

Summary: sssd_nss consumes more memory until restarted or machine swaps
Product: Red Hat Enterprise Linux 7 Reporter: aheverle
Component: sssdAssignee: SSSD Maintainers <sssd-maint>
Status: CLOSED ERRATA QA Contact: Niranjan Mallapadi Raghavender <mniranja>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.3CC: aheverle, andrewg.gray, atolani, cobrown, cparadka, daniel.hammann, ddas, ekeck, enewland, fidencio, gparente, grajaiya, imntreal, jhrozek, jvilicic, jwooten, knweiss, lslebodn, mkosek, mmilgram, mzidek, pbrezina, pdhamdhe, pkulkarn, serge.savard, sgoveas, sssd-maint, striker, tscherf, vmishra
Target Milestone: rcKeywords: ZStream
Target Release: 7.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: sssd-1.16.0-9.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1546064 (view as bug list) Environment:
Last Closed: 2018-04-10 17:13:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1546064    
Attachments:
Description Flags
valgrind output
none
sssd-valgrind-logs-rhel75
none
sssd-valgrind-logs-rhel75 with debugsymbols
none
valgrind logs
none
valgrind logs 1-18-2018
none
sssd debug log 1-18-2018 none

Description aheverle 2017-08-16 19:53:24 UTC
Description of problem:
sssd_nss consumes more memory until restarted or machine swaps.  The issue was supposed to be resolved in the sssd v1.15 package, but customer is still seeing the issue.

Version-Release number of selected component (if applicable):
sssd-1.15.2-50.el7.x86_64

How reproducible:
Everytime

[root@DC-Nagios1 ~]# date; ps -e -o 'vsz pid ruser cpu time args' |sort -nr|head -1
Wed Aug 16 06:36:21 CDT 2017
1861260 22139 root      - 00:12:49 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
[root@DC-Nagios1 ~]# systemctl restart sssd
[root@DC-Nagios1 ~]# date; ps -e -o 'vsz pid ruser cpu time args' |sort -nr|grep sssd[_]
Wed Aug 16 06:36:59 CDT 2017
484696 23399 root       - 00:00:00 /usr/libexec/sssd/sssd_be --domain ulinedm.com --uid 0 --gid 0 --debug-to-files
276368 23403 root       - 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
255768 23404 root       - 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files
[root@DC-Nagios1 ~]# date; ps -e -o 'vsz pid ruser cpu time args' |sort -nr|grep sssd[_]
Wed Aug 16 09:29:06 CDT 2017
542372 23403 root       - 00:02:17 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
502092 23399 root       - 00:00:02 /usr/libexec/sssd/sssd_be --domain ulinedm.com --uid 0 --gid 0 --debug-to-files
255768 23404 root       - 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files

[root@DC-Nagios1 ~]# date; ps -e -o 'vsz pid ruser cpu time args' |sort -nr|grep sssd[_]
Wed Aug 16 13:59:50 CDT 2017
962408 23403 root       - 00:05:38 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
517120 23399 root       - 00:00:05 /usr/libexec/sssd/sssd_be --domain ulinedm.com --uid 0 --gid 0 --debug-to-files
255768 23404 root       - 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files

Here is the pstack for pid 23403:
[root@DC-Nagios1 ~]# pstack 23403
#0  0x00007f9ae15eacf3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00007f9ae1ed6c13 in epoll_event_loop_once () from /lib64/libtevent.so.0
#2  0x00007f9ae1ed5257 in std_event_loop_once () from /lib64/libtevent.so.0
#3  0x00007f9ae1ed140d in _tevent_loop_once () from /lib64/libtevent.so.0
#4  0x00007f9ae1ed15ab in tevent_common_loop_wait () from /lib64/libtevent.so.0
#5  0x00007f9ae1ed51f7 in std_event_loop_wait () from /lib64/libtevent.so.0
#6  0x00007f9ae5a66b33 in server_loop () from /usr/lib64/sssd/libsss_util.so
#7  0x00007f9ae62d204d in main ()
[root@DC-Nagios1 ~]#

Comment 4 aheverle 2017-08-18 12:50:24 UTC
Created attachment 1315186 [details]
valgrind output

Comment 7 Lukas Slebodnik 2017-08-18 13:04:45 UTC
(In reply to aheverle from comment #4)
> Created attachment 1315186 [details]
> valgrind output

How long did they run sssd_nss with valgrind ?
I might be wrong but based on valgrind output it was only 59.70 seconds.
It would be good to test it at least an hour.

And please ask user to install debug symbols for packages sssd libtalloc libtevent libldb

Comment 8 Jakub Hrozek 2017-08-18 13:08:48 UTC
(In reply to Lukas Slebodnik from comment #7)
> (In reply to aheverle from comment #4)
> > Created attachment 1315186 [details]
> > valgrind output
> 
> How long did they run sssd_nss with valgrind ?
> I might be wrong but based on valgrind output it was only 59.70 seconds.
> It would be good to test it at least an hour.
> 

Yes, but if we leaked 1.3MB in a minute, that's not good either..

> And please ask user to install debug symbols for packages sssd libtalloc
> libtevent libldb

for the lib* packages, I hope that debuginfo-install would do the right thing also on RHEL. For the sssd package, since it's not a released one, the sssd-debuginfo must be grabbed from the brew link.

Comment 35 Serge Savard 2017-11-14 14:42:06 UTC
Hello,

I've been experiencing the memory leak issue. One thing I have found out is that doing a lot of sudo commands seems to cause/accelerate the memory leak in my case.

I have a monitoring daemon which launch around 200 sudo commands per minute. As soon as I disable it, the memory usage growth rate is much slower.

Maybe that information can help you identify the source of the leak. 

Best Regards,

Comment 41 Lukas Slebodnik 2017-11-23 13:03:06 UTC
(In reply to Deepak Das from comment #28)
> Created attachment 1337572 [details]
> sssd_nss.log
> 
> Sosreport in attached in the ticket.

This one seems to be the same case as in comment#33

==00:01:01:29.871 10088== 934,414 bytes in 8,731 blocks are definitely lost in loss record 121 of 121
==00:01:01:29.871 10088==    at 0x4C29BE3: malloc (vg_replace_malloc.c:299)
==00:01:01:29.871 10088==    by 0x8FF4EAB: talloc_strdup (in /usr/lib64/libtalloc.so.2.1.9)
==00:01:01:29.871 10088==    by 0x52933B9: sss_output_name (usertools.c:808)
==00:01:01:29.871 10088==    by 0x5293550: sss_output_fqname (usertools.c:863)
==00:01:01:29.871 10088==    by 0x1211F9: sized_output_name (responder_common.c:1708)
==00:01:01:29.871 10088==    by 0x1137E6: memcache_delete_entry (nss_get_object.c:112)
==00:01:01:29.871 10088==    by 0x113BB6: nss_get_object_done (nss_get_object.c:245)
==00:01:01:29.871 10088==    by 0x8DE5291: _tevent_req_error (in /usr/lib64/libtevent.so.0.9.31)
==00:01:01:29.871 10088==    by 0x1276CE: cache_req_done (cache_req.c:1047)
==00:01:01:29.871 10088==    by 0x8DE5291: _tevent_req_error (in /usr/lib64/libtevent.so.0.9.31)
==00:01:01:29.871 10088==    by 0x126AF6: cache_req_search_domains_done (cache_req.c:607)
==00:01:01:29.871 10088==    by 0x8DE4AB9: tevent_common_loop_immediate (in /usr/lib64/libtevent.so.0.9.31)
==00:01:01:29.871 10088==    by 0x8DE9C9C: ??? (in /usr/lib64/libtevent.so.0.9.31)
==00:01:01:29.871 10088==    by 0x8DE82A6: ??? (in /usr/lib64/libtevent.so.0.9.31)
==00:01:01:29.871 10088==    by 0x8DE40CC: _tevent_loop_once (in /usr/lib64/libtevent.so.0.9.31)
==00:01:01:29.871 10088==    by 0x8DE42FA: tevent_common_loop_wait (in /usr/lib64/libtevent.so.0.9.31)
==00:01:01:29.871 10088==    by 0x8DE8246: ??? (in /usr/lib64/libtevent.so.0.9.31)
==00:01:01:29.871 10088==    by 0x5291B32: server_loop (server.c:718)
==00:01:01:29.871 10088==    by 0x11004C: main (nsssrv.c:560)

And I think that problem is here https://pagure.io/SSSD/sssd/blob/master/f/src/responder/common/responder_common.c#_1818

We should use "name" as a context there. But on the other hand such code is there since 1.15.3 (a012a71f21) and I am not sure whether it was backported to rhel. I'll need to check source code of sssd in 7.4

Comment 44 Lukas Slebodnik 2017-11-29 13:51:17 UTC
Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3588

Comment 47 Lukas Slebodnik 2017-11-30 13:56:18 UTC
master:
* 051e0fc7cc86fb4e4b3a9323a61684ad3a6fa589
* 06e741a9bf23a18a998f366d9a8990b887a01638

Comment 48 Lukas Slebodnik 2017-11-30 14:03:27 UTC
I've never tried to reproduce issue with sssd_nss. I only vervied with unit test that leak is gone. But it should not be complicated to find a reproducer from backstrace:

934,414 bytes in 8,731 blocks are definitely lost in loss record 121 of 121 
   at 0x4C29BE3: malloc (vg_replace_malloc.c:299) 
   by 0x8FF4EAB: talloc_strdup (in /usr/lib64/libtalloc.so.2.1.9)
   by 0x52933B9: sss_output_name (usertools.c:808)
   by 0x5293550: sss_output_fqname (usertools.c:863)
   by 0x1211F9: sized_output_name (responder_common.c:1708) 
   by 0x1137E6: memcache_delete_entry (nss_get_object.c:112)
   by 0x113BB6: nss_get_object_done (nss_get_object.c:245)
   by 0x8DE5291: _tevent_req_error (in /usr/lib64/libtevent.so.0.9.31)
   by 0x1276CE: cache_req_done (cache_req.c:1047)
   by 0x8DE5291: _tevent_req_error (in /usr/lib64/libtevent.so.0.9.31)
   by 0x126AF6: cache_req_search_domains_done (cache_req.c:607)
   by 0x8DE4AB9: tevent_common_loop_immediate (in /usr/lib64/libtevent.so.0.9.31)
   by 0x8DE9C9C: ??? (in /usr/lib64/libtevent.so.0.9.31)
   by 0x8DE82A6: ??? (in /usr/lib64/libtevent.so.0.9.31)
   by 0x8DE40CC: _tevent_loop_once (in /usr/lib64/libtevent.so.0.9.31)
   by 0x8DE42FA: tevent_common_loop_wait (in /usr/lib64/libtevent.so.0.9.31)
   by 0x8DE8246: ??? (in /usr/lib64/libtevent.so.0.9.31)
   by 0x5291B32: server_loop (server.c:718)
   by 0x11004C: main (nsssrv.c:560)

Comment 51 Niranjan Mallapadi Raghavender 2017-12-11 05:36:37 UTC
Versions:
===========
Red Hat Enterprise Linux Server release 7.5 Beta (Maipo)
libsss_idmap-1.16.0-9.el7.x86_64
sssd-common-1.16.0-9.el7.x86_64
sssd-ipa-1.16.0-9.el7.x86_64
sssd-1.16.0-9.el7.x86_64
sssd-dbus-1.16.0-9.el7.x86_64
libsss_certmap-1.16.0-9.el7.x86_64
libsss_autofs-1.16.0-9.el7.x86_64
python-sssdconfig-1.16.0-9.el7.noarch
libsss_nss_idmap-1.16.0-9.el7.x86_64
sssd-krb5-common-1.16.0-9.el7.x86_64
sssd-ad-1.16.0-9.el7.x86_64
sssd-krb5-1.16.0-9.el7.x86_64
sssd-proxy-1.16.0-9.el7.x86_64
sssd-kcm-1.16.0-9.el7.x86_64
libsss_simpleifp-1.16.0-9.el7.x86_64
libsss_sudo-1.16.0-9.el7.x86_64
sssd-client-1.16.0-9.el7.x86_64
sssd-common-pac-1.16.0-9.el7.x86_64
sssd-ldap-1.16.0-9.el7.x86_64


Configuration:
==============
1. Configure sssd to authenticate against ldap (389-ds)

[sssd]
domains = EXAMPLE.TEST
config_file_version = 2
services = nss, pam, ifp

[domain/EXAMPLE.TEST]
enumerate = false
id_provider = ldap
ldap_uri = ldap://vm-idm-033.lab.eng.pnq.redhat.com
ldap_search_base = dc=example,dc=test
ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem
auth_provider = ldap
krb5_server = vm-idm-033.lab.eng.pnq.redhat.com
krb5_kpasswd = vm-idm-033.lab.eng.pnq.redhat.com
krb5_realm = EXAMPLE.TEST
debug_level = 9
cache_credentials = True
krb5_store_password_if_offline = True

[nss]
command = valgrind --leak-check=full --trace-children=yes --show-reachable=yes --track-origins=yes --read-var-info=yes --tool=memcheck --num-callers=50 -v --time-stamp=yes --log-file=/tmp/valgrind.%p.log /usr/libexec/sssd/sssd_nss
debug_level = 9


1. Start sssd with the above options

2. Run the below commands

NAME_PREFIX=`for i in {1..245}; do printf "x"; done`
for i in {1..1000000}; do getent passwd -s sss "$NAME_PREFIX$i"; done

Below is the valgrind Leak summary:

==00:22:16:04.837 18780== 
==00:22:16:04.837 18780== LEAK SUMMARY:
==00:22:16:04.837 18780==    definitely lost: 0 bytes in 0 blocks
==00:22:16:04.837 18780==    indirectly lost: 0 bytes in 0 blocks
==00:22:16:04.837 18780==      possibly lost: 5,912 bytes in 26 blocks
==00:22:16:04.837 18780==    still reachable: 405,246,675 bytes in 461 blocks
==00:22:16:04.837 18780==         suppressed: 0 bytes in 0 blocks
==00:22:16:04.837 18780== 
==00:22:16:04.837 18780== ERROR SUMMARY: 14 errors from 14 contexts (suppressed: 0 from 0)
==00:22:16:04.837 18780== 
==00:22:16:04.837 18780== 1 errors in context 1 of 14:
==00:22:16:04.837 18780== Conditional jump or move depends on uninitialised value(s)
==00:22:16:04.837 18780==    at 0x529EB87: become_user (in /usr/lib64/sssd/libsss_util.so)
==00:22:16:04.837 18780==    by 0x5293B81: server_setup (in /usr/lib64/sssd/libsss_util.so)
==00:22:16:04.837 18780==    by 0x110385: ??? (in /usr/libexec/sssd/sssd_nss)
==00:22:16:04.837 18780==    by 0x9872376: (below main) (in /usr/lib64/libc-2.17.so)
==00:22:16:04.837 18780==  Uninitialised value was created by a stack allocation
==00:22:16:04.837 18780==    at 0x10FE18: ??? (in /usr/libexec/sssd/sssd_nss)
==00:22:16:04.837 18780== 
==00:22:16:04.837 18780== 
==00:22:16:04.837 18780== 1 errors in context 2 of 14:
==00:22:16:04.837 18780== Syscall param chown(group) contains uninitialised byte(s)
==00:22:16:04.837 18780==    at 0x9940A87: chown (in /usr/lib64/libc-2.17.so)
==00:22:16:04.837 18780==    by 0x8BF4D32: chown_debug_file (in /usr/lib64/sssd/libsss_debug.so)
==00:22:16:04.837 18780==    by 0x5293B62: server_setup (in /usr/lib64/sssd/libsss_util.so)
==00:22:16:04.837 18780==    by 0x110385: ??? (in /usr/libexec/sssd/sssd_nss)
==00:22:16:04.837 18780==    by 0x9872376: (below main) (in /usr/lib64/libc-2.17.so)
==00:22:16:04.837 18780==  Uninitialised value was created by a stack allocation
==00:22:16:04.837 18780==    at 0x10FE18: ??? (in /usr/libexec/sssd/sssd_nss)
==00:22:16:04.837 18780== 
==00:22:16:04.837 18780== 
==00:22:16:04.837 18780== 1 errors in context 3 of 14:
==00:22:16:04.837 18780== Syscall param chown(owner) contains uninitialised byte(s)
==00:22:16:04.837 18780==    at 0x9940A87: chown (in /usr/lib64/libc-2.17.so)
==00:22:16:04.837 18780==    by 0x8BF4D32: chown_debug_file (in /usr/lib64/sssd/libsss_debug.so)
==00:22:16:04.837 18780==    by 0x5293B62: server_setup (in /usr/lib64/sssd/libsss_util.so)
==00:22:16:04.837 18780==    by 0x110385: ??? (in /usr/libexec/sssd/sssd_nss)
==00:22:16:04.837 18780==    by 0x9872376: (below main) (in /usr/lib64/libc-2.17.so)
==00:22:16:04.837 18780==  Uninitialised value was created by a stack allocation
==00:22:16:04.837 18780==    at 0x10FE18: ??? (in /usr/libexec/sssd/sssd_nss)
==00:22:16:04.837 18780== 
==00:22:16:04.837 18780== ERROR SUMMARY: 14 errors from 14 contexts (suppressed: 0 from 0)

Comment 52 Niranjan Mallapadi Raghavender 2017-12-11 05:38:10 UTC
Created attachment 1365757 [details]
sssd-valgrind-logs-rhel75

Comment 53 Niranjan Mallapadi Raghavender 2017-12-11 05:44:25 UTC
In another run on RHEL7.5 , ran valgrind again with steps mentioned comment #49, 

Below valgrind is collected after running for 20 hrs. 


==00:20:22:45.039 28985==
==00:20:22:45.039 28985== LEAK SUMMARY:
==00:20:22:45.039 28985==    definitely lost: 385 bytes in 3 blocks
==00:20:22:45.039 28985==    indirectly lost: 0 bytes in 0 blocks
==00:20:22:45.039 28985==      possibly lost: 5,912 bytes in 26 blocks
==00:20:22:45.039 28985==    still reachable: 405,208,361 bytes in 169 blocks
==00:20:22:45.039 28985==         suppressed: 0 bytes in 0 blocks
==00:20:22:45.039 28985==
==00:20:22:45.039 28985== ERROR SUMMARY: 15 errors from 15 contexts (suppressed: 0 from 0)
==00:20:22:45.039 28985==
==00:20:22:45.039 28985== 1 errors in context 1 of 15:
==00:20:22:45.039 28985== Conditional jump or move depends on uninitialised value(s)
==00:20:22:45.039 28985==    at 0x529EB87: become_user (in /usr/lib64/sssd/libsss_util.so)
==00:20:22:45.039 28985==    by 0x5293B81: server_setup (in /usr/lib64/sssd/libsss_util.so)
==00:20:22:45.039 28985==    by 0x110385: ??? (in /usr/libexec/sssd/sssd_nss)
==00:20:22:45.039 28985==    by 0x9872376: (below main) (in /usr/lib64/libc-2.17.so)
==00:20:22:45.039 28985==  Uninitialised value was created by a stack allocation
==00:20:22:45.039 28985==    at 0x10FE18: ??? (in /usr/libexec/sssd/sssd_nss)
==00:20:22:45.039 28985==
==00:20:22:45.039 28985==
==00:20:22:45.039 28985== 1 errors in context 2 of 15:
==00:20:22:45.039 28985== Syscall param chown(group) contains uninitialised byte(s)
==00:20:22:45.039 28985==    at 0x9940A87: chown (in /usr/lib64/libc-2.17.so)
==00:20:22:45.039 28985==    by 0x8BF4D32: chown_debug_file (in /usr/lib64/sssd/libsss_debug.so)
==00:20:22:45.039 28985==    by 0x5293B62: server_setup (in /usr/lib64/sssd/libsss_util.so)
==00:20:22:45.039 28985==    by 0x110385: ??? (in /usr/libexec/sssd/sssd_nss)
==00:20:22:45.039 28985==    by 0x9872376: (below main) (in /usr/lib64/libc-2.17.so)
==00:20:22:45.039 28985==  Uninitialised value was created by a stack allocation
==00:20:22:45.039 28985==    at 0x10FE18: ??? (in /usr/libexec/sssd/sssd_nss)
==00:20:22:45.039 28985==
==00:20:22:45.039 28985==
==00:20:22:45.039 28985== 1 errors in context 3 of 15:
==00:20:22:45.039 28985== Syscall param chown(owner) contains uninitialised byte(s)
==00:20:22:45.039 28985==    at 0x9940A87: chown (in /usr/lib64/libc-2.17.so)
==00:20:22:45.039 28985==    by 0x8BF4D32: chown_debug_file (in /usr/lib64/sssd/libsss_debug.so)
==00:20:22:45.039 28985==    by 0x5293B62: server_setup (in /usr/lib64/sssd/libsss_util.so)
==00:20:22:45.039 28985==    by 0x110385: ??? (in /usr/libexec/sssd/sssd_nss)
==00:20:22:45.039 28985==    by 0x9872376: (below main) (in /usr/lib64/libc-2.17.so)
==00:20:22:45.039 28985==  Uninitialised value was created by a stack allocation
==00:20:22:45.039 28985==    at 0x10FE18: ??? (in /usr/libexec/sssd/sssd_nss)
==00:20:22:45.039 28985==
==00:20:22:45.039 28985== ERROR SUMMARY: 15 errors from 15 contexts (suppressed: 0 from 0)

Comment 54 Niranjan Mallapadi Raghavender 2017-12-11 12:28:43 UTC
Created attachment 1365955 [details]
sssd-valgrind-logs-rhel75 with debugsymbols

Comment 65 Niranjan Mallapadi Raghavender 2018-01-09 09:52:06 UTC
Versions:
========
sssd-client-1.16.0-14.el7.x86_64
sssd-krb5-common-1.16.0-14.el7.x86_64
sssd-krb5-1.16.0-14.el7.x86_64
sssd-kcm-1.16.0-14.el7.x86_64
python-sssdconfig-1.16.0-14.el7.noarch
sssd-common-1.16.0-14.el7.x86_64
sssd-common-pac-1.16.0-14.el7.x86_64
sssd-ad-1.16.0-14.el7.x86_64
sssd-ldap-1.16.0-14.el7.x86_64
sssd-1.16.0-14.el7.x86_64
sssd-debuginfo-1.16.0-14.el7.x86_64
sssd-ipa-1.16.0-14.el7.x86_64
sssd-proxy-1.16.0-14.el7.x86_64
libldb-debuginfo-1.2.2-1.el7.x86_64
libldb-1.2.2-1.el7.x86_64
libtdb-1.3.15-1.el7.x86_64
libtdb-debuginfo-1.3.15-1.el7.x86_64
dbus-libs-1.10.24-3.el7.x86_64
dbus-1.10.24-3.el7.x86_64
dbus-python-1.1.1-9.el7.x86_64
dbus-glib-0.100-7.el7.x86_64
python-slip-dbus-0.4.0-4.el7.noarch



sssd.conf
========
[sssd]
domains = EXAMPLE.TEST
config_file_version = 2
services = nss, pam

[domain/EXAMPLE.TEST]
enumerate = false
id_provider = ldap
ldap_uri = ldap://ibm-x3250m4-06.lab.eng.rdu2.redhat.com
ldap_search_base = dc=example,dc=test
ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem
auth_provider = ldap
krb5_server = ibm-x3250m4-06.lab.eng.rdu2.redhat.com
krb5_kpasswd = ibm-x3250m4-06.lab.eng.rdu2.redhat.com
krb5_realm = EXAMPLE.TEST
debug_level = 9
cache_credentials = True
krb5_store_password_if_offline = True

[nss]
command = valgrind --leak-check=full --trace-children=yes --show-reachable=yes --track-origins=yes --read-var-info=yes --tool=memcheck --num-callers=50 -v --time-stamp=yes --log-file=/tmp/valgrind.%p.log /usr/libexec/sssd/sssd_nss
debug_level = 9


Ran the above script for more than a hour. 

Below is the leak summary:

==00:01:46:06.559 22435== LEAK SUMMARY:
==00:01:46:06.559 22435==    definitely lost: 385 bytes in 3 blocks
==00:01:46:06.559 22435==    indirectly lost: 0 bytes in 0 blocks
==00:01:46:06.559 22435==      possibly lost: 5,912 bytes in 26 blocks
==00:01:46:06.559 22435==    still reachable: 273,267 bytes in 171 blocks
==00:01:46:06.559 22435==         suppressed: 0 bytes in 0 blocks
==00:01:46:06.559 22435==
==00:01:46:06.559 22435== ERROR SUMMARY: 15 errors from 15 contexts (suppressed: 0 from 0)
==00:01:46:06.559 22435==
==00:01:46:06.559 22435== 1 errors in context 1 of 15:
==00:01:46:06.559 22435== Conditional jump or move depends on uninitialised value(s)
==00:01:46:06.559 22435==    at 0x529EBC7: become_user (become_user.c:38)
==00:01:46:06.559 22435==    by 0x5293BC1: server_setup (server.c:483)
==00:01:46:06.559 22435==    by 0x1104A5: main (nsssrv.c:465)
==00:01:46:06.559 22435==  Uninitialised value was created by a stack allocation
==00:01:46:06.559 22435==    at 0x10FF38: main (nsssrv.c:422)
==00:01:46:06.559 22435==
==00:01:46:06.559 22435==
==00:01:46:06.559 22435== 1 errors in context 2 of 15:
==00:01:46:06.559 22435== Syscall param chown(group) contains uninitialised byte(s)
==00:01:46:06.559 22435==    at 0x9940A87: chown (in /usr/lib64/libc-2.17.so)
==00:01:46:06.559 22435==    by 0x8BF4D32: chown_debug_file (debug.c:389)
==00:01:46:06.559 22435==    by 0x5293BA2: server_setup (server.c:477)
==00:01:46:06.559 22435==    by 0x1104A5: main (nsssrv.c:465)
==00:01:46:06.559 22435==  Uninitialised value was created by a stack allocation
==00:01:46:06.559 22435==    at 0x10FF38: main (nsssrv.c:422)
==00:01:46:06.559 22435==
==00:01:46:06.559 22435==
==00:01:46:06.559 22435== 1 errors in context 3 of 15:
==00:01:46:06.559 22435== Syscall param chown(owner) contains uninitialised byte(s)
==00:01:46:06.559 22435==    at 0x9940A87: chown (in /usr/lib64/libc-2.17.so)
==00:01:46:06.559 22435==    by 0x8BF4D32: chown_debug_file (debug.c:389)
==00:01:46:06.559 22435==    by 0x5293BA2: server_setup (server.c:477)
==00:01:46:06.559 22435==    by 0x1104A5: main (nsssrv.c:465)
==00:01:46:06.559 22435==  Uninitialised value was created by a stack allocation
==00:01:46:06.559 22435==    at 0x10FF38: main (nsssrv.c:422)
==00:01:46:06.559 22435==
==00:01:46:06.559 22435== ERROR SUMMARY: 15 errors from 15 contexts (suppressed: 0 from 0)

Comment 66 Niranjan Mallapadi Raghavender 2018-01-09 09:53:34 UTC
Created attachment 1378925 [details]
valgrind logs

Comment 68 Lukas Slebodnik 2018-01-09 11:26:28 UTC
(In reply to Niranjan Mallapadi Raghavender from comment #66)
> Created attachment 1378925 [details]
> valgrind logs
leak caused by https://pagure.io/SSSD/sssd/issue/3588 is gone

Thank you for verification.

Comment 69 joel 2018-01-18 16:54:13 UTC
Created attachment 1382975 [details]
valgrind logs 1-18-2018

Comment 70 joel 2018-01-18 16:55:27 UTC
Created attachment 1382976 [details]
sssd debug log 1-18-2018

Comment 71 joel 2018-01-18 17:05:14 UTC
My cu installed the test patch, pulled debug data sssd level 9 (attachment 1382976 [details]) and valgrind (attachment 1382975 [details])

Comment 72 Jakub Hrozek 2018-01-18 19:14:09 UTC
(In reply to joel from comment #71)
> My cu installed the test patch, pulled debug data sssd level 9 (attachment
> 1382976 [details]) and valgrind (attachment 1382975 [details])

I'm not sure I understand, the valgrind log shows some leaks but they are in the kilobytes range at worst. Are they still seeing memory usage go up even with the test version?

Comment 75 Andrew Gray 2018-01-25 01:39:40 UTC
I'm also seeing this. 22G resident for sssd_nss process before process restart.
sssd versions used are 1.14.0-43.el7_3.4, uptime is about 230 days.

I'm not really in a position to do any testing or updates though, so I can't help in this regard, other to confirm it's not just the 1.16 version.

Comment 88 Daniel Hammann 2018-03-19 08:23:40 UTC
We have the same issue in production. The memoryleak causes a kill of the process with the most memory allocation, in our case our mysql-database. At the moment the only way is to restart sssd_ns via cron after some time. 

Do you have sny timeline to fix this?
Is there any possibility to increase the priority?

Comment 89 Fabiano Fidêncio 2018-03-19 08:45:59 UTC
(In reply to Daniel Hammann from comment #88)
> We have the same issue in production. The memoryleak causes a kill of the
> process with the most memory allocation, in our case our mysql-database. At
> the moment the only way is to restart sssd_ns via cron after some time. 
> 
> Do you have sny timeline to fix this?
> Is there any possibility to increase the priority?

This fix will be released as part of RHEL-7.5 release and also in the next RHEL-7.4-z batch updated.

Please, if you need it sooner, raise a support request and we'll be able to provide you a build.

Comment 90 Daniel Hammann 2018-03-19 12:27:46 UTC
(In reply to Fabiano Fidêncio from comment #89)
> (In reply to Daniel Hammann from comment #88)
> > We have the same issue in production. The memoryleak causes a kill of the
> > process with the most memory allocation, in our case our mysql-database. At
> > the moment the only way is to restart sssd_ns via cron after some time. 
> > 
> > Do you have sny timeline to fix this?
> > Is there any possibility to increase the priority?
> 
> This fix will be released as part of RHEL-7.5 release and also in the next
> RHEL-7.4-z batch updated.
> 
> Please, if you need it sooner, raise a support request and we'll be able to
> provide you a build.

Thanks for the fast response!
What is the planned release date of RHEL-7.4-z?

Comment 91 Fabiano Fidêncio 2018-03-19 13:06:37 UTC
(In reply to Daniel Hammann from comment #90)
> (In reply to Fabiano Fidêncio from comment #89)
> > (In reply to Daniel Hammann from comment #88)
> > > We have the same issue in production. The memoryleak causes a kill of the
> > > process with the most memory allocation, in our case our mysql-database. At
> > > the moment the only way is to restart sssd_ns via cron after some time. 
> > > 
> > > Do you have sny timeline to fix this?
> > > Is there any possibility to increase the priority?
> > 
> > This fix will be released as part of RHEL-7.5 release and also in the next
> > RHEL-7.4-z batch updated.
> > 
> > Please, if you need it sooner, raise a support request and we'll be able to
> > provide you a build.
> 
> Thanks for the fast response!
> What is the planned release date of RHEL-7.4-z?

Daniel,

"Soon, as in a couple of weeks from now" is the answer that I can give you.

Please, if you need it sooner, raiser a support request and we'll be bale to provide you a build.

Comment 93 errata-xmlrpc 2018-04-10 17:13:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0929