863131 – sssd_nss process hangs, stuck in loop; "self restart" does recover, but old process hangs around using 100% CPU

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 863131 - sssd_nss process hangs, stuck in loop; "self restart" does recover, but old process hangs around using 100% CPU

Summary: sssd_nss process hangs, stuck in loop; "self restart" does recover, but old p...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	6.4
Hardware:	i686
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jakub Hrozek
QA Contact:	Kaushik Banerjee
Docs Contact:
URL:
Whiteboard:
Depends On:	814513
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-04 13:47 UTC by Dmitri Pal
Modified:	2020-05-04 10:33 UTC (History)
CC List:	7 users (show)
Fixed In Version:	sssd-1.9.2-1.el6
Doc Type:	Bug Fix
Doc Text:	No Documentation Needed
Clone Of:	814513
Environment:
Last Closed:	2013-02-21 09:37:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	SSSD sssd issues 2593	0	None	None	None	2020-05-04 10:33:33 UTC
Red Hat Product Errata	RHSA-2013:0508	0	normal	SHIPPED_LIVE	Low: sssd security, bug fix and enhancement update	2013-02-20 21:30:10 UTC

Description Dmitri Pal 2012-10-04 13:47:20 UTC

+++ This bug was initially created as a clone of Bug #814513 +++

Created attachment 578851 [details]
sssd gdb backtrace and step

Description of problem:

The sssd_nss process, shown in ps as:
/usr/libexec/sssd/sssd_nss -d 0 --debug-to-files

will occasionally hang, using 100% CPU. sssd's "self restart" system automatically starts a new process, as noted in /var/log/messages:
Apr 18 19:43:10 blog2 sssd[nss]: Starting up

so the system does continue to function normally. However, the old process will continue to run until killed, using 100% CPU.

Version-Release number of selected component (if applicable):

$ yum list sssd
sssd.i386 1.5.1-49.el5_8.1 installed

$ rpm -qi sssd
Name        : sssd                         Relocations: (not relocatable)
Version     : 1.5.1                             Vendor: CentOS
Release     : 49.el5_8.1                    Build Date: Mon 02 Apr 2012 02:00:34 PM EDT
Install Date: Fri 06 Apr 2012 11:43:52 PM EDT      Build Host: builder10.centos.org
Group       : Applications/System           Source RPM: sssd-1.5.1-49.el5_8.1.src.rpm
Size        : 3680458                          License: GPLv3+
Signature   : DSA/SHA1, Mon 02 Apr 2012 02:12:41 PM EDT, Key ID a8a447dce8562897

How reproducible:

Unknown.

This intermittently, but somewhat regularly, occurs on a particular class of servers I run. The servers are running several (3 separate) apache httpd instances and some custom perl scripts. I don't know what exactly the servers are doing that triggers this bug. The bug is triggered approximately once a week.

Actual results:

sssd process hangs, stuck in a loop.

Expected results:

sssd doesn't hang.


Additional info:

* nothing in /var/log/sssd log files
* no output from strace
sudo strace -p 19311
Process 19311 attached - interrupt to quit
# nothing more

* attached gdb
sudo yum install gdb
wget http://debuginfo.centos.org/5/i386/sssd-debuginfo-1.5.1-49.el5_8.1.i386.rpm
wget http://debuginfo.centos.org/5/i386/glibc-debuginfo-2.5-81.el5_8.1.i686.rpm
wget http://debuginfo.centos.org/5/i386/glibc-debuginfo-common-2.5-81.el5_8.1.i386.rpm
wget http://debuginfo.centos.org/5/i386/libtevent-debuginfo-0.9.8-10.el5.i386.rpm
sudo rpm -ivh glibc-debuginfo-2.5-81.el5_8.1.i686.rpm glibc-debuginfo-common-2.5-81.el5_8.1.i386.rpm libtevent-debuginfo-0.9.8-10.el5.i386.rpm sssd-debuginfo-1.5.1-49.el5_8.1.i386.rpm

* see attachment for gdb info

--- Additional comment from jhrozek on 2012-04-20 08:30:31 EDT ---

Thank you for the back trace, it is very useful.

It looks like the NSS responder is stuck in a loop because there are no results to return, but the current code should catch that case..

Is it possible to get SSSD logs from the time the bug has happened? Could you put "debug_level = 7" into the "nss" and "domain" sections of the sssd.conf, restart SSSD and then attach both /var/log/sssd/sssd_nss.log and /var/log/sssd/sssd_domain.log ?

The code in question is also only called when enumeration is set to True - you haven't attached your sssd.conf but I assume you are using enumerate=True? A short-term workaround might be to turn enumeration off and only use direct lookups, if that's a feasible option in your environment.

--- Additional comment from sgallagh on 2012-07-09 11:06:14 EDT ---

This ticket has been in NEEDSINFO for more than two months. Closing it as INSUFFICIENT_DATA. If the requested information is provided, please reopen this bug.

--- Additional comment from sterling.daniel on 2012-09-24 15:50:29 EDT ---

Created attachment 616710 [details]
patch with additional debugging statements and workaround

OK, this bug is still affecting me. Due to its intermittent nature, it's difficult to reproduce and debug, but I do finally have some additional debugging info that may be useful, and I also have a bit of a workaround.

I created a patch that enables debugging when the issue is hit, and also adds debugging lines to show some of the values and codepaths being exercised. The patch also adds an iterator that bails out of the loop after 100,000 iterations, which works around the bug. (See the attachment for the patch.)

With the patch in place, I get the following lines in the sssd_nss.log:

(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 1
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, pdom->res->count: 13, cctx->pwent_cur: 14
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, num: 256
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): bottom of nss_cmd_retpwent loop
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 2
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, pdom->res->count: 13, cctx->pwent_cur: 14
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, num: 256
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): bottom of nss_cmd_retpwent loop
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 3
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, pdom->res->count: 13, cctx->pwent_cur: 14
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, num: 256
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): bottom of nss_cmd_retpwent loop

... and so on until it bails out:

(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 99999
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, pdom->res->count: 13, cctx->pwent_cur: 14
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, num: 256
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): bottom of nss_cmd_retpwent loop
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 100000
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): out of nss_cmd_retpwent loop: ret: 2, i: 100000
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (0): nss_cmd_retpwent returning EINVAL due to runaway loop check

Walking through the code in responder/nss/nsssrv_cmd.c, in the nss_cmd_retpwent function:

static int nss_cmd_retpwent(struct cli_ctx *cctx, int num)

Before the loop:
    nctx = talloc_get_type(cctx->rctx->pvt_ctx, struct nss_ctx);
    pctx = nctx->pctx;

The following loop runs once before the debugging kicks in (see the attached patched). In normal circumstances, this loop never runs more than once; I've only seen this loop actually loop when the bug is hit. So, the values in the logs are from the second time the loop is run through:
    while (ret == ENOENT) {

This conditional doesn't cause us to exit the loop:
        if (cctx->pwent_dom_idx >= pctx->num) break;

n is set:
        pdom = &pctx->doms[cctx->pwent_dom_idx];
        n = pdom->res->count - cctx->pwent_cur;

n is now -1, which is (I assume) unexpected! However, the following code doesn't run (based on the lack of debugging output), because (cctx->pwent_dom_idx+1 < pctx->num) is false:

        if (n <= 0 && (cctx->pwent_dom_idx+1 < pctx->num)) {
            cctx->pwent_dom_idx++;
            pdom = &pctx->doms[cctx->pwent_dom_idx];
            n = pdom->res->count;
            cctx->pwent_cur = 0;
            DEBUG(7, ("nss_cmd_retpwent n reset, now: %i\n", n));
        }

This conditional doesn't throw us out of the loop because n isn't 0, even though it's negative:
        if (!n) break;

This also doesn't reset n:
        if (n > num) n = num;

And the call to fill_pwent is a basically a noop because n is negative, however it does reset n to 0:
        msgs = &(pdom->res->msgs[cctx->pwent_cur]);
        ret = fill_pwent(cctx->creq->out, pdom->domain, nctx, true, msgs, &n);

cctx->pwent_cur is therefore unchanged here:
        cctx->pwent_cur += n;

End of while loop:
    }

So, the loop never stops because n starts out as -1 and then is reset to 0 by the fill_pwent call, which prevents the usual conditional checks from breaking out of the loop.

I don't pretend to understand all the interactions here, but I'm hoping these details provide some context to someone more knowledgeable with the codebase.

If you need me to provide additional debugging info, please let me know.

Thanks,
Dan

--- Additional comment from sterling.daniel on 2012-09-24 15:50:51 EDT ---

reopening per comment #2

--- Additional comment from jhrozek on 2012-09-27 12:22:51 EDT ---

Thank you for the very detailed debug info Daniel. At first glance I agree with your assessment about n being negative breaking the loop.

I'll clone the issue upstream.

Thanks again.

--- Additional comment from jhrozek on 2012-09-27 12:27:54 EDT ---

Upstream ticket:
https://fedorahosted.org/sssd/ticket/1551

Comment 3 Kaushik Banerjee 2013-01-29 12:09:28 UTC

Marking this bug as Verified Sanity Only as all tests pass with version 1.9.2-82

Comment 4 errata-xmlrpc 2013-02-21 09:37:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0508.html

Note You need to log in before you can comment on or make changes to this bug.