Bug 918394

Summary: sssd etas 99% CPU and runs out of file descriptors when clearing cache
Product: Red Hat Enterprise Linux 6 Reporter: Harald Strack <hstrack>
Component: sssdAssignee: Jakub Hrozek <jhrozek>
Status: CLOSED ERRATA QA Contact: Kaushik Banerjee <kbanerje>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4CC: grajaiya, hstrack, jgalipea, lamar.folsom, lnovich, mkosek, nkarandi, pbrezina
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: sssd-1.9.2-112.el6 Doc Type: Bug Fix
Doc Text:
Cause: The SSSD did not close file descriptor to the memory cache in case the memory cache was reset with the sss_cache tool. Consequence: Running sss_cache resulted in a fd leak Fix: The sssd was amended so that the file descriptor to the memory cache is closed correctly. Result: Running sss_cache no longer results in a memory leak.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-21 22:15:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Harald Strack 2013-03-06 07:24:12 UTC
Description of problem:
When we clear the sss-cache by using sss_cache -U, sss_cache -G, sss_cache -u <login> the process sssd_nss takes each time some fds more. When the process reaches its fd_limit, sssd runs at 99% CPU and the system gets unresponsive for every user-related task.


Version-Release number of selected component (if applicable):
rpm -qa | grep sssd
sssd-tools-1.9.2-82.el6.x86_64
sssd-client-1.9.2-82.el6.x86_64
sssd-1.9.2-82.el6.x86_64


How reproducible:
Everytime we run 

sss_cache -U or sss_cache -u <login>

the number of open files increases up to the fd_limit. Then, sssd runs at 99% CPU and no nss is working anymore...

Steps to Reproduce:
1. service sssd start #start service
2. watch "lsof -p `ps -ef | grep sssd_nss | grep -v grep |  perl -l -a -n -F"\s+" -e 'print $F[1]'` | wc -l" #watch fds
3. sss_cache -U #clear cache several times and watch the number of fds
  
Actual results:
Increasing number of fds for the sssd_nss process

Expected results:
Constant number of fds for the sssd_nss process


Additional info:
The leaking fds are all pointing to this files, lsof output:

sssd_nss 2090 root 8176u   REG                8,1   6806312    3424241 /var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8177u   REG                8,1   5206312    3424243 /var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8178u   REG                8,1   6806312    3424242 /var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8179u   REG                8,1   5206312    3424245 /var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8180u   REG                8,1   6806312    3424247 /var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8181u   REG                8,1   6806312    3424244 /var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8182u   REG                8,1   5206312    3424246 /var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8183u   REG                8,1   5206312    3424248 /var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8184u   REG                8,1   5206312    3424250 /var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8185u   REG                8,1   6806312    3424251 /var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8186u   REG                8,1   5206312    3424252 /var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8187u   REG                8,1   6806312    3424253 /var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8188u   REG                8,1   5206312    3424254 /var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8189u   REG                8,1   6806312   11493377 /var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8190u   REG                8,1   6806312    3424255 /var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8191u   REG                8,1   5206312    3424256 /var/lib/sss/mc/group (deleted)


The reason for the CPU usage is the error handling after epoll_wait(), strace output:

epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40376) = 1

Workaround:
We set the fd_limit in the [nss] section of sssd.conf to a much too high value
and restart sssd with our NMS when it approaches the limit. 

[nss]
entry_negative_timeout = 0
debug_level = 0x1310
fd_limit=200000

This is not yet fixed in the packages in this repo

[sssd-1.9-RHEL6.3]
name=SSSD 1.9.x built for latest stable RHEL
baseurl=http://repos.fedorapeople.org/repos/jhrozek/sssd/epel-6/$basearch/
enabled=1
skip_if_unavailable=1
gpgcheck=0

Comment 1 Jakub Hrozek 2013-03-06 10:19:48 UTC
I can reproduce. Thank you for the bug report.

Comment 2 Jakub Hrozek 2013-03-06 10:22:06 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1826

Comment 3 Jakub Hrozek 2013-03-07 16:21:11 UTC
(In reply to comment #0)
> [sssd-1.9-RHEL6.3]
> name=SSSD 1.9.x built for latest stable RHEL
> baseurl=http://repos.fedorapeople.org/repos/jhrozek/sssd/epel-6/$basearch/
> enabled=1
> skip_if_unavailable=1
> gpgcheck=0

Harald, thank you for testing the packages from this repository. But the repository was intended just as a preview for testing purposes. Please do not rely on that repo for production systems.

Comment 4 Harald Strack 2013-03-07 16:33:15 UTC
We are not using these packages. I only wanted to point out that the Problem is  not yet fixed in your latest testing package.

Comment 5 Jakub Hrozek 2013-05-10 15:10:25 UTC
Fixed upstream.

Comment 11 Nirupama Karandikar 2013-10-25 12:14:21 UTC
Tested with sssd-1.9.2-128.el6.x86_64

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: sssd etas 99% CPU and runs out of file descriptors when clearing cache BZ 918394
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   PASS   ] :: Running 'getent passwd sssduser1| grep sssduser1' (Expected 0, got 0)
:: [   PASS   ] :: sssd_nss is not leaking FDs 
:: [   PASS   ] :: sssd_nss is not leaking FDs 
:: [   PASS   ] :: sssd_nss is not leaking FDs 
:: [   PASS   ] :: sssd_nss is not leaking FDs 
:: [   PASS   ] :: sssd_nss is not leaking FDs 
:: [   LOG    ] :: Duration: 5s
:: [   LOG    ] :: Assertions: 6 good, 0 bad
:: [   PASS   ] :: RESULT: sssd etas 99% CPU and runs out of file descriptors when clearing cache BZ 918394

Comment 12 errata-xmlrpc 2013-11-21 22:15:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1680.html