RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 692251 - SSSD consumes GBs of RAM, possible memory leak
Summary: SSSD consumes GBs of RAM, possible memory leak
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libtdb
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Stephen Gallagher
QA Contact: Chandrasekar Kannan
URL:
Whiteboard:
Depends On:
Blocks: 693785 694146 694149 694217
TreeView+ depends on / blocked
 
Reported: 2011-03-30 19:51 UTC by Trond H. Amundsen
Modified: 2015-01-04 23:47 UTC (History)
9 users (show)

Fixed In Version: libtdb-1.2.1-3.el6
Doc Type: Bug Fix
Doc Text:
When processing very large database updates, occasionally, tdb incorrectly allocated unnecessarily large amounts of memory. As a result, an OOM (Out Of Memory) situation occurred. This was caused by libtdb calling the tdb_expand() function to increase its size by creating space to store an extra hundred records of the size of the largest record it currently has. With this update, this behavior has been removed and tdb no longer allocates unnecessarily large amounts of memory.
Clone Of:
: 693785 694146 (view as bug list)
Environment:
Last Closed: 2011-05-19 14:30:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Sanitized sssd.conf (486 bytes, application/octet-stream)
2011-03-30 19:51 UTC, Trond H. Amundsen
no flags Details
Valgrind analysis of sssd when the issue occurs (23.53 KB, application/x-bzip2)
2011-04-01 09:16 UTC, Trond H. Amundsen
no flags Details
Valgrind analysis of sssd when the issue occurs (with debuginfo) (39.70 KB, application/x-bzip2)
2011-04-01 12:35 UTC, Trond H. Amundsen
no flags Details
perf.data captured with 'perf record -p <pid of sssd_be>' (729.17 KB, application/x-bzip2)
2011-04-01 13:36 UTC, Trond H. Amundsen
no flags Details
perf report created with 'perf report --stdio -n -v' (68.83 KB, text/plain)
2011-04-01 13:51 UTC, Trond H. Amundsen
no flags Details
test tool to reproduce this issue (2.19 KB, text/plain)
2011-04-29 13:53 UTC, Namita Soman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0808 0 normal SHIPPED_LIVE libtdb bug fix update 2011-05-18 18:21:18 UTC

Description Trond H. Amundsen 2011-03-30 19:51:52 UTC
Created attachment 488890 [details]
Sanitized sssd.conf

We're having an issue with SSSD, where sssd_be suddenly goes insane
and allocates several GBs of RAM very quickly, until it is killed by
OOM. If I stop sssd, remove the cache files and start sssd again,
everything is OK for a while (typically a couple of hours).

The way I manage to trigger this is as follows:

  1. Stop sssd, remove cache files, start sssd. This way we have a
     clean system to begin with.

  2. Run 'id <username>' on a few users and/or 'ls -l' on files that
     have different owners/groups. sssd will build its cache.

  3. Wait a few hours.

  4. Repeat step 2, sssd_be goes crazy.

This doesn't always work, but usually I'm able to make sssd go
amok. After it has happened once, running 'id <someuser>' is usually
enough to trigger the issue again, until the cache files are removed
and sssd restarted manually.

From what I can tell, when this erradic bahaviour happens sssd doesn't
even contact the LDAP server. There seems to be something in the cache
that triggers it.

Some details about the environment:

  * LDAP used as both id and auth provider
  * LDAP schema: rfc2307
  * ~70k users, ~9k groups
  * Some groups are very large, ~25k members

I have tested this on RHEL6.0 (sssd-1.2.1-28.el6_0.4.x86_64) and
RHEL6.1-beta (sssd-1.5.1-14.el6.x86_64) and the latest beta packages
(sssd-1.5.1-23.el6.x86_64). Seemingly same behaviour.

Comment 1 Trond H. Amundsen 2011-03-30 19:57:25 UTC
Created attachment 488891 [details]
Log file created with 'debug_level = 9'

This is a log from when this happened. This particular server only has 1GB of memory, so it ended with an OOM panic.

Comment 2 Trond H. Amundsen 2011-03-30 19:58:49 UTC
I'm marking this bug as security sensitive, since the attached log includes lots of user and group names.

Comment 4 Stephen Gallagher 2011-03-30 20:04:09 UTC
This bug should not be marked as security-sensitive. Security team, please clear the security-sensitive box. I have moved the log to private.

Comment 5 Stephen Gallagher 2011-03-30 20:18:31 UTC
The logs don't seem suspect (no obvious errors). I do however note that there are some ENORMOUS groups here. My best guess right now is that we're holding onto too much memory during the processing of the groups (I think we may not be freeing anything until the request is fully-complete).

We probably need to investigate what we can free earlier.

I'd guess that the user is getting into this state whenever multiple very large groups have an ongoing request simultaneously.

Comment 6 Trond H. Amundsen 2011-03-31 16:57:47 UTC
I did some additional testing just to see how much memory was allocated if the process was allowed to grow. Reproducing this issue on a server with 72GB memory I was able to bring sssd_be to >70GB memory usage, before I had to kill it off to avoid a panic. This happened when I did 'id user' on a user which is member of several large groups. I'm not familiar with the sssd code, but I'm thinking that this points toward some sort of race condition while processing the cache.

I have tried to mitigate the problem by using 'filter_groups' to filter out groups with many members (>2k). This did not seem to help, but I have to test it some more to be 100% sure.

Comment 7 Trond H. Amundsen 2011-04-01 09:16:54 UTC
Created attachment 489330 [details]
Valgrind analysis of sssd when the issue occurs

I managed to run sssd to valgrind and capture the output. In this particular case, I ran 'id user' on a user who is not member of many large groups, so the memory usage stopped at 4.2GB before sssd_be finished. Since it didn't run until killed by oom-killer, valgrind could conclude.

The command to start sssd through valgrind was:

  valgrind -v --log-file=/root/valgrind-sssd-%p.log --show-reachable=yes --trace-children=yes --leak-check=full /usr/sbin/sssd -f -D

I hope this helps.

One thing I have noticed is that when the memory is allocated and sssd_be is able to finish up what it is doing (i.e. not killed by oom-killer), the memory is not freed. In the case above, the memory usage stayed at 4.2GB even after sssd_be was finished processing.

-trond

Comment 8 Trond H. Amundsen 2011-04-01 12:35:46 UTC
Created attachment 489379 [details]
Valgrind analysis of sssd when the issue occurs (with debuginfo)

I have re-run sssd through valgrind with all relevant debuginfo packages installed, so symbols are included. Valgrind/sssd was started in the same way as described in my previous comment.

Comment 9 Trond H. Amundsen 2011-04-01 13:36:29 UTC
Created attachment 489396 [details]
perf.data captured with 'perf record -p <pid of sssd_be>'

Comment 10 Trond H. Amundsen 2011-04-01 13:51:35 UTC
Created attachment 489407 [details]
perf report created with 'perf report --stdio -n -v'

Comment 11 Trond H. Amundsen 2011-04-03 14:47:50 UTC
I have played around with this during the weekend and have made a few observations of what happens when the issue occurs.

Using 'perf top -z -p <pid of sssd_be>' I was able to determine what sssd_be does when memory usage grows through the roof. In every test I have made, when the issue occurs the function ldb_db_explode() is heavily used.

Here is a snapshot of 'perf top' when the issue is in progress:

-------------------------------------------------------------------------------
   PerfTop:    4607 irqs/sec  kernel:14.1%  exact:  0.0% [1000Hz cpu-clock-msecs],  (target_pid: 26904)
-------------------------------------------------------------------------------

             samples  pcnt function                          DSO
             _______ _____ _________________________________ __________________

             1565.00 17.0% _int_malloc                       libc-2.12.so      
              789.00  8.6% _int_free                         libc-2.12.so      
              758.00  8.2% ldb_dn_explode                    libldb.so.0.9.10  
              681.00  7.4% _talloc_free_internal             libtalloc.so.2.0.1
              497.00  5.4% memcpy                            libc-2.12.so      
              492.00  5.3% __GI___libc_malloc                libc-2.12.so      
              387.00  4.2% talloc_strdup                     libtalloc.so.2.0.1
              357.00  3.9% native_set_pte_at                 [kernel.kallsyms] 
              304.00  3.3% clear_page_c                      [kernel.kallsyms] 
              291.00  3.2% __GI___strcasecmp_l_ssse3         libc-2.12.so      
              246.00  2.7% malloc_consolidate                libc-2.12.so      
              173.00  1.9% unmap_vmas                        [kernel.kallsyms] 
              172.00  1.9% __strlen_sse2                     libc-2.12.so      
              155.00  1.7% ldb_schema_attribute_by_name_inte libldb.so.0.9.10  
              136.00  1.5% ltdb_unpack_data                  libldb.so.0.9.10  

Using tcpdump I have found that during the memory growth, sssd does not communicate with the LDAP server. Iotop shows no disk activity (unless logging is turned on).

Using 'debug_level = 9', the only thing it writes to the log during the memory allocation is lots and lots of this:

(Sun Apr  3 16:35:37 2011) [sssd[be[default]]] [ldb] (9): tevent: Added timed ev
ent "ltdb_timeout": 0x91069030
(Sun Apr  3 16:35:37 2011) [sssd[be[default]]] [ldb] (9): tevent: Destroying tim
er event 0x9106eb20 "ltdb_timeout"
(Sun Apr  3 16:35:37 2011) [sssd[be[default]]] [ldb] (9): tevent: Ending timer e
vent 0x9106ed40 "ltdb_callback"

The last lines written to the log _before_ memory starts growing is:

(Sun Apr  3 16:35:01 2011) [sssd[be[default]]] [sdap_get_groups_done] (9): Group
s remaining: 0
(Sun Apr  3 16:35:01 2011) [sssd[be[default]]] [sdap_get_groups_done] (9): All g
roups processed
(Sun Apr  3 16:35:01 2011) [sssd[be[default]]] [ldb] (9): start ldb transaction 
(nesting: 1)
(Sun Apr  3 16:35:01 2011) [sssd[be[default]]] [sdap_save_group] (7): Adding ori
ginal DN [cn=campus,cn=filegroups,cn=system,dc=uio,dc=no] to attributes of [camp
us].
(Sun Apr  3 16:35:01 2011) [sssd[be[default]]] [sdap_save_group] (7): Original U
SN value is not available for [campus].
(Sun Apr  3 16:35:01 2011) [sssd[be[default]]] [sdap_save_group] (6): Storing in
fo for group campus


One last interesting bit:

1. I trigger the issue by running 'id user'
2. sssd_be goes berserk and starts allocating memory
3. I kill off sssd_be with 'kill -9'
4. The command 'id user' then returns with correct information(!)

Hopefully these observations are helpful in arriving at a diagnosis.

PS. What I don't understand is that sssd works fine and performs well for an hour or two, if the cache is removed before sssd is started. If sssd is started with an pre-existing cache, I'm able to trigger the issue at once.

Comment 12 George B. Magklaras 2011-04-04 22:52:32 UTC
(In reply to comment #11)

> PS. What I don't understand is that sssd works fine and performs well for an
> hour or two, if the cache is removed before sssd is started. If sssd is started
> with an pre-existing cache, I'm able to trigger the issue at once.

I second that question. Same behavior observed exactly.

GM

Comment 13 Stephen Gallagher 2011-04-05 14:49:31 UTC
Many thanks to Simo Sorce, who was able to track down the source of this problem today.

The problem is actually in libtdb, not the SSSD itself. When libtdb calls tdb_expand() to increase its size, it is doing so by creating space to store an extra hundred records of the size of the largest record it currently has.

In SSSD's case, we're dealing with records that can reach dozens of megabytes. Then tdb_expand() suddenly increases to take up gigabytes of space.

Comment 16 Stephen Gallagher 2011-04-06 15:07:36 UTC
There are two parts to this issue. One part involves some bad realloc() logic
in libtdb. This bug will track that fix.

I will clone this bug back to SSSD to resolve the other issue (which involves
the SSSD's memberOf plugin)

Comment 18 Martin Prpič 2011-04-15 14:26:17 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When processing very large database updates, occasionally, tdb incorrectly allocated unnecessarily large amounts of memory. As a result, an OOM (Out Of Memory) situation occurred. This was caused by libtdb calling the tdb_expand() function to increase its size by creating space to store an extra hundred records of the size of the largest record it currently has. With this update, this behavior has been removed and tdb no longer allocates unnecessarily large amounts of memory.

Comment 19 Namita Soman 2011-04-29 13:52:47 UTC
Simo provided a tool (attaching here) to see the issue. He provided it - with directions and expectations:
=========
the attached program can be used as a reproducer.

The test is easy:
- run once with old tdb package, annotate size of file test.tdb
- run again with new tdb package: see that size of test.tdb is an order
of magnitude smaller.

If you want to keep around both files as proof you can also set the
TEST_DB_PATH variable to put the test file in different directories w/o
having to mv them after each tests. (you can also run the binary in
different dirs, as by default it puts the file in the dir you are
running the binary from.

Compile it with:
gcc -o tdb_expand_test -ltdb tdb_expand_test.c
===========

Compiled, then ran the program with old and new libtdb.

Using libtdb-1.2.1-2.el6.x86_64, size was:
test.tdb    143933440

Using libtdb-1.2.1-3.el6.x86_64 size was:
test.tdb  15482880

Comment 20 Namita Soman 2011-04-29 13:53:49 UTC
Created attachment 495785 [details]
test tool to reproduce this issue

Comment 21 errata-xmlrpc 2011-05-19 14:30:54 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0808.html


Note You need to log in before you can comment on or make changes to this bug.