Bug 814513
Summary: | sssd_nss process hangs, stuck in loop; "self restart" does recover, but old process hangs around using 100% CPU | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Daniel S. Sterling <sterling.daniel> | ||||||||
Component: | sssd | Assignee: | Jakub Hrozek <jhrozek> | ||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | IDM QE LIST <seceng-idm-qe-list> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 5.8 | CC: | dpal, grajaiya, jgalipea, jhrozek, pbrezina, prc | ||||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 863131 (view as bug list) | Environment: | |||||||||
Last Closed: | 2012-12-13 08:28:33 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 863131 | ||||||||||
Attachments: |
|
Description
Daniel S. Sterling
2012-04-20 03:35:01 UTC
Thank you for the back trace, it is very useful. It looks like the NSS responder is stuck in a loop because there are no results to return, but the current code should catch that case.. Is it possible to get SSSD logs from the time the bug has happened? Could you put "debug_level = 7" into the "nss" and "domain" sections of the sssd.conf, restart SSSD and then attach both /var/log/sssd/sssd_nss.log and /var/log/sssd/sssd_domain.log ? The code in question is also only called when enumeration is set to True - you haven't attached your sssd.conf but I assume you are using enumerate=True? A short-term workaround might be to turn enumeration off and only use direct lookups, if that's a feasible option in your environment. This ticket has been in NEEDSINFO for more than two months. Closing it as INSUFFICIENT_DATA. If the requested information is provided, please reopen this bug. Created attachment 616710 [details]
patch with additional debugging statements and workaround
OK, this bug is still affecting me. Due to its intermittent nature, it's difficult to reproduce and debug, but I do finally have some additional debugging info that may be useful, and I also have a bit of a workaround.
I created a patch that enables debugging when the issue is hit, and also adds debugging lines to show some of the values and codepaths being exercised. The patch also adds an iterator that bails out of the loop after 100,000 iterations, which works around the bug. (See the attachment for the patch.)
With the patch in place, I get the following lines in the sssd_nss.log:
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 1
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, pdom->res->count: 13, cctx->pwent_cur: 14
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, num: 256
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): bottom of nss_cmd_retpwent loop
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 2
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, pdom->res->count: 13, cctx->pwent_cur: 14
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, num: 256
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): bottom of nss_cmd_retpwent loop
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 3
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, pdom->res->count: 13, cctx->pwent_cur: 14
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, num: 256
(Sun Sep 23 16:11:26 2012) [sssd[nss]] [nss_cmd_retpwent] (7): bottom of nss_cmd_retpwent loop
... and so on until it bails out:
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 99999
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, pdom->res->count: 13, cctx->pwent_cur: 14
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): nss_cmd_retpwent n: -1, num: 256
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): bottom of nss_cmd_retpwent loop
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): top of nss_cmd_retpwent loop, i: 100000
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (7): out of nss_cmd_retpwent loop: ret: 2, i: 100000
(Sun Sep 23 16:11:34 2012) [sssd[nss]] [nss_cmd_retpwent] (0): nss_cmd_retpwent returning EINVAL due to runaway loop check
Walking through the code in responder/nss/nsssrv_cmd.c, in the nss_cmd_retpwent function:
static int nss_cmd_retpwent(struct cli_ctx *cctx, int num)
Before the loop:
nctx = talloc_get_type(cctx->rctx->pvt_ctx, struct nss_ctx);
pctx = nctx->pctx;
The following loop runs once before the debugging kicks in (see the attached patched). In normal circumstances, this loop never runs more than once; I've only seen this loop actually loop when the bug is hit. So, the values in the logs are from the second time the loop is run through:
while (ret == ENOENT) {
This conditional doesn't cause us to exit the loop:
if (cctx->pwent_dom_idx >= pctx->num) break;
n is set:
pdom = &pctx->doms[cctx->pwent_dom_idx];
n = pdom->res->count - cctx->pwent_cur;
n is now -1, which is (I assume) unexpected! However, the following code doesn't run (based on the lack of debugging output), because (cctx->pwent_dom_idx+1 < pctx->num) is false:
if (n <= 0 && (cctx->pwent_dom_idx+1 < pctx->num)) {
cctx->pwent_dom_idx++;
pdom = &pctx->doms[cctx->pwent_dom_idx];
n = pdom->res->count;
cctx->pwent_cur = 0;
DEBUG(7, ("nss_cmd_retpwent n reset, now: %i\n", n));
}
This conditional doesn't throw us out of the loop because n isn't 0, even though it's negative:
if (!n) break;
This also doesn't reset n:
if (n > num) n = num;
And the call to fill_pwent is a basically a noop because n is negative, however it does reset n to 0:
msgs = &(pdom->res->msgs[cctx->pwent_cur]);
ret = fill_pwent(cctx->creq->out, pdom->domain, nctx, true, msgs, &n);
cctx->pwent_cur is therefore unchanged here:
cctx->pwent_cur += n;
End of while loop:
}
So, the loop never stops because n starts out as -1 and then is reset to 0 by the fill_pwent call, which prevents the usual conditional checks from breaking out of the loop.
I don't pretend to understand all the interactions here, but I'm hoping these details provide some context to someone more knowledgeable with the codebase.
If you need me to provide additional debugging info, please let me know.
Thanks,
Dan
reopening per comment #2 Thank you for the very detailed debug info Daniel. At first glance I agree with your assessment about n being negative breaking the loop. I'll clone the issue upstream. Thanks again. Upstream ticket: https://fedorahosted.org/sssd/ticket/1551 This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Hello, we have fixed this issue by simply breaking the loop when the variable n is negative. However, we were not able to find the origin of this bug. There were some scenarios, but all are very improbable (see the upstream ticket [1] for more information). I prepared a scratch build [2] that contains the fix and more debugging information that should help us find the cause. Would you be as kind as to test it? Can you please run this build with debug_level = 8 in nss and domain sections? If the error occurs again, it will print "ERROR n < 0. Breaking loop..." into logs and it will not go into infinite loop again. Do the perl scripts you mentioned call getpwent in some way? Thank you very much. [1] https://fedorahosted.org/sssd/ticket/1551 [2] http://pbrezina.fedorapeople.org/scratch/sssd-1.5.1-49.el5_8.4.i386/ (In reply to comment #8) > Do the perl scripts you mentioned call getpwent in some way? I don't know perl, but the command that was breaking would be invoked as "getent passwd" from shell. Hi, have you had a chance to try the scratch build provided in comment #8 ? Even though we don't plan another update in RHEL5, we might roll out an upstream 1.5.x version with the fix which might be useful for you. No response from the reporter for almost two months. Closing. (In reply to comment #12) > No response from the reporter for almost two months. Closing. Ah thanks, do you still happen to have the patch you applied to generate the debug package above? If it's the same patch that got committed to the fedorahosted ticket, I've already pushed that out to my boxes. I didn't enable the high level of debugging, though. But, without the high level of debugging enabled, I did receive the below log entries. Since this problem happens intermittently (sometimes not for months at a time), it may be problematic to leave a high level of debugging on constantly. If you can send me a quick patch that always logs any relevant debugging info, I can build that myself and deploy it. If you don't have time for that, I can just turn on debugging and wait, I suppose. Thanks again for having looked into this. Here are the messages I did recently see: -- No log messages in November; two messages just recently: (Mon Dec 10 10:21:38 2012) [sssd[nss]] [nss_cmd_retpwent] (0): BUG: Negative difference[13 - 14 = -1] (Mon Dec 10 10:21:38 2012) [sssd[nss]] [nss_cmd_retpwent] (0): Domain: 0 (total 1) tail: `sssd_nss.log' has been replaced; following end of new file (Wed Dec 12 14:25:49 2012) [sssd[nss]] [nss_cmd_retpwent] (0): BUG: Negative difference[13 - 14 = -1] (Wed Dec 12 14:25:49 2012) [sssd[nss]] [nss_cmd_retpwent] (0): Domain: 0 (total 1) Just to be clear, if you still have the patch file or source RPM from the build above, just stick it in the ticket or email it to me at sterling.daniel and I'll build it and push it out to my boxes. Thanks again, Dan Created attachment 663060 [details]
more debugging information
Hi, thank you. I'm attaching the patch that was present in the scratch build. All additional info is print with debug level set to 0, but may not be enough without other tracing information already present in the code. And as I'm unsure what we really need, it would be nice if you could rise the debug level to 8 as suggested. Unfortunately the log you provided doesn't say anything about why it happened. It just says that it tried to manipulate with more entries that are present. Do you know if a user was added/deleted in that time? (In reply to comment #16) > Hi, > thank you. I'm attaching the patch that was present in the scratch build. > All additional info is print with debug level set to 0, but may not be > enough without other tracing information already present in the code. And as > I'm unsure what we really need, it would be nice if you could rise the debug > level to 8 as suggested. Thank you very much, I'll build this and push it out with the higher debugging level. > Unfortunately the log you provided doesn't say anything about why it > happened. It just says that it tried to manipulate with more entries that > are present. Do you know if a user was added/deleted in that time? Hmm, I'm not sure entirely, but something like that may be happening. I run puppet and it manages some users and does a lot of lookups, so there may be a weird interaction happening there. I'll try running puppet a bunch to see if I can trigger the issue. If I get something useful I'll upload it and reopen the ticket. Thanks again, Dan |