Red Hat Bugzilla – Bug 918394
sssd etas 99% CPU and runs out of file descriptors when clearing cache
Last modified: 2014-03-27 05:17:34 EDT
Description of problem: When we clear the sss-cache by using sss_cache -U, sss_cache -G, sss_cache -u <login> the process sssd_nss takes each time some fds more. When the process reaches its fd_limit, sssd runs at 99% CPU and the system gets unresponsive for every user-related task. Version-Release number of selected component (if applicable): rpm -qa | grep sssd sssd-tools-1.9.2-82.el6.x86_64 sssd-client-1.9.2-82.el6.x86_64 sssd-1.9.2-82.el6.x86_64 How reproducible: Everytime we run sss_cache -U or sss_cache -u <login> the number of open files increases up to the fd_limit. Then, sssd runs at 99% CPU and no nss is working anymore... Steps to Reproduce: 1. service sssd start #start service 2. watch "lsof -p `ps -ef | grep sssd_nss | grep -v grep | perl -l -a -n -F"\s+" -e 'print $F[1]'` | wc -l" #watch fds 3. sss_cache -U #clear cache several times and watch the number of fds Actual results: Increasing number of fds for the sssd_nss process Expected results: Constant number of fds for the sssd_nss process Additional info: The leaking fds are all pointing to this files, lsof output: sssd_nss 2090 root 8176u REG 8,1 6806312 3424241 /var/lib/sss/mc/passwd (deleted) sssd_nss 2090 root 8177u REG 8,1 5206312 3424243 /var/lib/sss/mc/group (deleted) sssd_nss 2090 root 8178u REG 8,1 6806312 3424242 /var/lib/sss/mc/passwd (deleted) sssd_nss 2090 root 8179u REG 8,1 5206312 3424245 /var/lib/sss/mc/group (deleted) sssd_nss 2090 root 8180u REG 8,1 6806312 3424247 /var/lib/sss/mc/passwd (deleted) sssd_nss 2090 root 8181u REG 8,1 6806312 3424244 /var/lib/sss/mc/passwd (deleted) sssd_nss 2090 root 8182u REG 8,1 5206312 3424246 /var/lib/sss/mc/group (deleted) sssd_nss 2090 root 8183u REG 8,1 5206312 3424248 /var/lib/sss/mc/group (deleted) sssd_nss 2090 root 8184u REG 8,1 5206312 3424250 /var/lib/sss/mc/group (deleted) sssd_nss 2090 root 8185u REG 8,1 6806312 3424251 /var/lib/sss/mc/passwd (deleted) sssd_nss 2090 root 8186u REG 8,1 5206312 3424252 /var/lib/sss/mc/group (deleted) sssd_nss 2090 root 8187u REG 8,1 6806312 3424253 /var/lib/sss/mc/passwd (deleted) sssd_nss 2090 root 8188u REG 8,1 5206312 3424254 /var/lib/sss/mc/group (deleted) sssd_nss 2090 root 8189u REG 8,1 6806312 11493377 /var/lib/sss/mc/passwd (deleted) sssd_nss 2090 root 8190u REG 8,1 6806312 3424255 /var/lib/sss/mc/passwd (deleted) sssd_nss 2090 root 8191u REG 8,1 5206312 3424256 /var/lib/sss/mc/group (deleted) The reason for the CPU usage is the error handling after epoll_wait(), strace output: epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1 accept(23, 0x149b38e0, [110]) = -1 EMFILE (Too many open files) epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1 accept(23, 0x149b38e0, [110]) = -1 EMFILE (Too many open files) epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1 accept(23, 0x149b38e0, [110]) = -1 EMFILE (Too many open files) epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1 accept(23, 0x149b38e0, [110]) = -1 EMFILE (Too many open files) epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1 accept(23, 0x149b38e0, [110]) = -1 EMFILE (Too many open files) epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1 accept(23, 0x149b38e0, [110]) = -1 EMFILE (Too many open files) epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40376) = 1 Workaround: We set the fd_limit in the [nss] section of sssd.conf to a much too high value and restart sssd with our NMS when it approaches the limit. [nss] entry_negative_timeout = 0 debug_level = 0x1310 fd_limit=200000 This is not yet fixed in the packages in this repo [sssd-1.9-RHEL6.3] name=SSSD 1.9.x built for latest stable RHEL baseurl=http://repos.fedorapeople.org/repos/jhrozek/sssd/epel-6/$basearch/ enabled=1 skip_if_unavailable=1 gpgcheck=0
I can reproduce. Thank you for the bug report.
Upstream ticket: https://fedorahosted.org/sssd/ticket/1826
(In reply to comment #0) > [sssd-1.9-RHEL6.3] > name=SSSD 1.9.x built for latest stable RHEL > baseurl=http://repos.fedorapeople.org/repos/jhrozek/sssd/epel-6/$basearch/ > enabled=1 > skip_if_unavailable=1 > gpgcheck=0 Harald, thank you for testing the packages from this repository. But the repository was intended just as a preview for testing purposes. Please do not rely on that repo for production systems.
We are not using these packages. I only wanted to point out that the Problem is not yet fixed in your latest testing package.
Fixed upstream.
Tested with sssd-1.9.2-128.el6.x86_64 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ LOG ] :: sssd etas 99% CPU and runs out of file descriptors when clearing cache BZ 918394 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ PASS ] :: Running 'getent passwd sssduser1| grep sssduser1' (Expected 0, got 0) :: [ PASS ] :: sssd_nss is not leaking FDs :: [ PASS ] :: sssd_nss is not leaking FDs :: [ PASS ] :: sssd_nss is not leaking FDs :: [ PASS ] :: sssd_nss is not leaking FDs :: [ PASS ] :: sssd_nss is not leaking FDs :: [ LOG ] :: Duration: 5s :: [ LOG ] :: Assertions: 6 good, 0 bad :: [ PASS ] :: RESULT: sssd etas 99% CPU and runs out of file descriptors when clearing cache BZ 918394
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1680.html