1087833 – nscd-2.12-1.132.el6 enters busy loop on long netgroup entry via nss_ldap of nslcd

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1087833 - nscd-2.12-1.132.el6 enters busy loop on long netgroup entry via nss_ldap of nslcd

Summary: nscd-2.12-1.132.el6 enters busy loop on long netgroup entry via nss_ldap of n...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	6.5
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Siddhesh Poyarekar
QA Contact:	Arjun Shankar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1173537
TreeView+	depends on / blocked

Reported:	2014-04-15 12:24 UTC by Michael Weiser
Modified:	2018-12-09 17:44 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glibc-2.12-1.144.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1173537 (view as bug list)
Environment:
Last Closed:	2014-10-14 04:43:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fix nscd tryagain busy loop (468 bytes, patch) 2014-04-15 12:25 UTC, Michael Weiser	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2014:1391	0	normal	SHIPPED_LIVE	Moderate: glibc security, bug fix, and enhancement update	2014-10-14 01:11:04 UTC
Sourceware	16878	0	None	None	None	Never

Description Michael Weiser 2014-04-15 12:24:52 UTC

Description of problem:

If there's a long (>1024 bytes) netgroup entry retrieved via nslcd's nss_ldap, nscd-2.12-1.132.el6 with netgroup caching enabled will enter a busy loop, hoging a CPU. Each repetition causes another nscd thread to busy loop, consequently using up all available CPU time.

This is caused by nss_ldap returning NSS_STATUS_TRYAGAIN if the buffer provided by nscd is not large enough to hold the result while nscd is expecting it to return NSS_STATUS_UNAVAIL. So, instead of resizing its buffer and calling nss_getnetgrent again, it loops over without doing anything.

The attached patch fixes nscd for me.

Alternatively, nslcd could be changed not to return NSS_STATUS_TRYAGAIN, but upstream seems to have put some thought into its behaviour: https://github.com/arthurdejong/nss-pam-ldapd/blob/master/nss/common.h#L59

Version-Release number of selected component (if applicable):
nscd-2.12-1.132.el6

How reproducible:
always

Steps to Reproduce:
1. add long netgroup entry to LDAP
2. enable netgroup caching in nscd
3. getent netgroup longnetgroupentry

Actual results:
nscd thread enters busy loop, getent times out waiting for nscd and gets netgroup data directly

Expected results:
nscd returns netgroup data

Additional info:
This bug is still present in upstream glibc HEAD and probably should be pushed upstream as well.

Workaround:
Disable netgroup caching in /etc/nscd.conf.
Possibly (untested): Split netgroup into multiple, recursive groups, thus avoiding a single, very large netgroup entry.

Comment 1 Michael Weiser 2014-04-15 12:25:44 UTC

Created attachment 886459 [details]
fix nscd tryagain busy loop

Comment 3 Siddhesh Poyarekar 2014-04-30 04:34:11 UTC

I was able to reproduce the infinite loop by having a very long combination of user, host and domain in a single triplet, but that triplet does not give me a valid result without nscd.  This was expected because a valid triplet should fit in 1K given that the components of the triplet (i.e. the hostname, username and domain name) have defined limits well within 1K.

The getent command does not work without nscd because it uses getnetgrent(), which in turn assumes this static limit of 1K and fails.  Given that ldap supports such long entries, there could be a case for adding support for such long entries, but adding such support would mean enhancing getnetgrent as well.

Of course, I'd like to know if you're seeing the same scenario that I described, i.e. the netgroup coming up empty without nscd.  If it is not (which I assume it is since you mentioned the timeout and the direct query resulting in the correct output) then could you share a sample netgroup entry that we can use to try and figure out what is different?

Comment 4 Michael Weiser 2014-04-30 08:36:59 UTC

Hi Padesh,

> Of course, I'd like to know if you're seeing the same scenario that I
> described, i.e. the netgroup coming up empty without nscd.  If it is not

I've played around with very large triplets as well and see segfaults with that. See https://bugzilla.redhat.com/show_bug.cgi?id=1087838. But that's not what I see and do with this LDAP bug.

> (which I assume it is since you mentioned the timeout and the direct query
> resulting in the correct output) then could you share a sample netgroup
> entry that we can use to try and figure out what is different?

nscd.conf:

[root@test ~]# grep netgroup /etc/nscd.conf
        enable-cache            netgroup        yes
        positive-time-to-live   netgroup        28800
        negative-time-to-live   netgroup        20
        suggested-size          netgroup        211
        check-files             netgroup        yes
        persistent              netgroup        yes
        shared                  netgroup        yes
        max-db-size             netgroup        33554432

Two test netgroups:

[root@test ~]# ldapsearch -H ldaps://ldapserver.domain:636/ -b dc=domain -xLLL cn=test
dn: cn=test,ou=netgroup,dc=domain
cn: test
objectClass: top
objectClass: nisNetgroup
nisNetgroupTriple: (test1,-,test)
nisNetgroupTriple: (test10,-,test)
nisNetgroupTriple: (test11,-,test)
nisNetgroupTriple: (test12,-,test)
nisNetgroupTriple: (test13,-,test)
nisNetgroupTriple: (test14,-,test)
nisNetgroupTriple: (test15,-,test)
nisNetgroupTriple: (test16,-,test)
nisNetgroupTriple: (test17,-,test)
nisNetgroupTriple: (test18,-,test)
nisNetgroupTriple: (test19,-,test)
nisNetgroupTriple: (test2,-,test)
nisNetgroupTriple: (test20,-,test)
nisNetgroupTriple: (test21,-,test)
nisNetgroupTriple: (test22,-,test)
nisNetgroupTriple: (test23,-,test)
nisNetgroupTriple: (test24,-,test)
nisNetgroupTriple: (test25,-,test)
nisNetgroupTriple: (test26,-,test)
nisNetgroupTriple: (test27,-,test)
nisNetgroupTriple: (test28,-,test)
nisNetgroupTriple: (test29,-,test)
nisNetgroupTriple: (test3,-,test)
nisNetgroupTriple: (test30,-,test)
nisNetgroupTriple: (test4,-,test)
nisNetgroupTriple: (test5,-,test)
nisNetgroupTriple: (test6,-,test)
nisNetgroupTriple: (test7,-,test)
nisNetgroupTriple: (test8,-,test)
nisNetgroupTriple: (test9,-,test)
nisNetgroupTriple: (test31,-,test)
nisNetgroupTriple: (test32,-,test)
nisNetgroupTriple: (test33,-,test)
nisNetgroupTriple: (test34,-,test)
nisNetgroupTriple: (test35,-,test)
nisNetgroupTriple: (test36,-,test)
nisNetgroupTriple: (test37,-,test)
nisNetgroupTriple: (test38,-,test)
nisNetgroupTriple: (test39,-,test)
nisNetgroupTriple: (test40,-,test)
nisNetgroupTriple: (test41,-,test)
nisNetgroupTriple: (test42,-,test)
nisNetgroupTriple: (test43,-,test)
nisNetgroupTriple: (test44,-,test)
nisNetgroupTriple: (test45,-,test)
nisNetgroupTriple: (test46,-,test)
nisNetgroupTriple: (test47,-,test)
nisNetgroupTriple: (test48,-,test)
nisNetgroupTriple: (test49,-,test)
nisNetgroupTriple: (test50,-,test)
nisNetgroupTriple: (test51,-,test)
nisNetgroupTriple: (test52,-,test)
nisNetgroupTriple: (test53,-,test)
nisNetgroupTriple: (test54,-,test)
nisNetgroupTriple: (test55,-,test)
nisNetgroupTriple: (test56,-,test)
nisNetgroupTriple: (test57,-,test)
nisNetgroupTriple: (test58,-,test)
nisNetgroupTriple: (test59,-,test)
nisNetgroupTriple: (test60,-,test)
nisNetgroupTriple: (test61,-,test)
nisNetgroupTriple: (test62,-,test)
nisNetgroupTriple: (test63,-,test)
nisNetgroupTriple: (test64,-,test)
nisNetgroupTriple: (test65,-,test)
nisNetgroupTriple: (test66,-,test)
nisNetgroupTriple: (test67,-,test)
nisNetgroupTriple: (test68,-,test)
nisNetgroupTriple: (test69,-,test)
nisNetgroupTriple: (test70,-,test)
nisNetgroupTriple: (test71,-,test)

dn: cn=test2,ou=netgroup,dc=domain
cn: test2
objectClass: top
objectClass: nisNetgroup
nisNetgroupTriple: (test1,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test10,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test11,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test12,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test13,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test14,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test2,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test3,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test4,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test5,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test6,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test7,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test8,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test9,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test15,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)

Restart nscd with clean cache and getent the groups while timing how long that takes:

[root@test ~]# killall nscd ; rm -f /var/db/nscd/* ; nscd
nscd: no process killed
[root@test ~]# time getent netgroup test
test                  (test1,-,test) (test10,-,test) (test11,-,test) (test12,-,test) (test13,-,test) (test14,-,test) (test15,-,test) (test16,-,test) (test17,-,test) (test18,-,test) (test19,-,test) (test2,-,test) (test20,-,test) (test21,-,test) (test22,-,test) (test23,-,test) (test24,-,test) (test25,-,test) (test26,-,test) (test27,-,test) (test28,-,test) (test29,-,test) (test3,-,test) (test30,-,test) (test4,-,test) (test5,-,test) (test6,-,test) (test7,-,test) (test8,-,test) (test9,-,test) (test31,-,test) (test32,-,test) (test33,-,test) (test34,-,test) (test35,-,test) (test36,-,test) (test37,-,test) (test38,-,test) (test39,-,test) (test40,-,test) (test41,-,test) (test42,-,test) (test43,-,test) (test44,-,test) (test45,-,test) (test46,-,test) (test47,-,test) (test48,-,test) (test49,-,test) (test50,-,test) (test51,-,test) (test52,-,test) (test53,-,test) (test54,-,test) (test55,-,test) (test56,-,test) (test57,-,test) (test58,-,test) (test59,-,test) (test60,-,test) (test61,-,test) (test62,-,test) (test63,-,test) (test64,-,test) (test65,-,test) (test66,-,test) (test67,-,test) (test68,-,test) (test69,-,test) (test70,-,test) (test71,-,test)

real    0m5.007s
user    0m0.000s
sys     0m0.002s
[root@test ~]# time getent netgroup test2
test2                 (test1,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test10,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test11,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test12,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test13,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test14,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test2,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test3,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test4,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test5,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test6,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test7,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test8,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test9,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test15,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem)

real    0m5.003s
user    0m0.002s
sys     0m0.000s

nscd is hogging two CPUs now:

[root@test ~]# top -n 1 -b | head -8
top - 10:13:37 up 19 days, 32 min,  2 users,  load average: 13.31, 12.52, 12.21
Tasks: 559 total,  13 running, 545 sleeping,   0 stopped,   1 zombie
Cpu(s): 11.6%us, 10.5%sy, 15.6%ni, 62.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65922808k total, 15001704k used, 50921104k free,   291304k buffers
Swap: 33030136k total,        0k used, 33030136k free,  9209884k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23029 nscd      20   0  619m 1268  892 S 198.7  0.0   0:33.44 nscd

Size of the entries returned:

[root@test ~]# getent netgroup test | wc
      1      72    1149
[root@test ~]# getent netgroup test2 | wc
      1      16    1048

Remove test71 from netgroup test and test15 from netgroup test2 and run the test again:

[root@test ~]# killall nscd ; rm -f /var/db/nscd/* ; nscd
nscd: no process killed
[root@test ~]# time getent netgroup test
test                  (test1,-,test) (test10,-,test) (test11,-,test) (test12,-,test) (test13,-,test) (test14,-,test) (test15,-,test) (test16,-,test) (test17,-,test) (test18,-,test) (test19,-,test) (test2,-,test) (test20,-,test) (test21,-,test) (test22,-,test) (test23,-,test) (test24,-,test) (test25,-,test) (test26,-,test) (test27,-,test) (test28,-,test) (test29,-,test) (test3,-,test) (test30,-,test) (test4,-,test) (test5,-,test) (test6,-,test) (test7,-,test) (test8,-,test) (test9,-,test) (test31,-,test) (test32,-,test) (test33,-,test) (test34,-,test) (test35,-,test) (test36,-,test) (test37,-,test) (test38,-,test) (test39,-,test) (test40,-,test) (test41,-,test) (test42,-,test) (test43,-,test) (test44,-,test) (test45,-,test) (test46,-,test) (test47,-,test) (test48,-,test) (test49,-,test) (test50,-,test) (test51,-,test) (test52,-,test) (test53,-,test) (test54,-,test) (test55,-,test) (test56,-,test) (test57,-,test) (test58,-,test) (test59,-,test) (test60,-,test) (test61,-,test) (test62,-,test) (test63,-,test) (test64,-,test) (test65,-,test) (test66,-,test) (test67,-,test) (test68,-,test) (test69,-,test) (test70,-,test)

real    0m0.003s
user    0m0.000s
sys     0m0.001s
[root@test ~]# time getent netgroup test2
test2                 (test1,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test10,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test11,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test12,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test13,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test14,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test2,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test3,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test4,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test5,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test6,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test7,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test8,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test9,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem)

real    0m0.002s
user    0m0.000s
sys     0m0.000s
[root@test ~]# top -n 1 -b | head -8
top - 10:26:48 up 19 days, 45 min,  2 users,  load average: 12.55, 13.12, 12.87
Tasks: 555 total,  13 running, 541 sleeping,   0 stopped,   1 zombie
Cpu(s): 11.6%us, 10.5%sy, 15.6%ni, 62.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65922808k total, 15526096k used, 50396712k free,   322776k buffers
Swap: 33030136k total,        0k used, 33030136k free,  9732668k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20084 user  39  19  475m 372m  21m R 99.0  0.6 340:26.48 solver

Length of entries:

[root@test ~]# getent netgroup test | wc
      1      71    1133
[root@test ~]# getent netgroup test2 | wc
      1      15     979

So, any sufficiently large netgroup seems to do it, although it doesn't happen exactly at exceeding 1024 bytes of length.

Hope that helps,
Michael

Comment 5 Siddhesh Poyarekar 2014-04-30 09:15:09 UTC

(In reply to Michael Weiser from comment #4)
> Hi Padesh,

That's not me :)

> Two test netgroups:

Thanks, that helped.  I have posted a patch upstream for review:

https://sourceware.org/ml/libc-alpha/2014-04/msg00661.html

You should be in cc as well.  Your analysis is correct and your fix should work too, but I went for a different approach in the fix because NSS_STATUS_TRYAGAIN is indeed the correct status in such cases.  The netgroups bits used NSS_STATUS_UNAVAIL incorrectly.

Comment 7 Michael Weiser 2014-04-30 09:38:44 UTC

Hello *Siddhesh*,

> > Hi Padesh,
> 
> That's not me :)

Sorry, momentary loss of all brain functions. Sincere apologies.

> > Two test netgroups:
> 
> Thanks, that helped.  I have posted a patch upstream for review:
> 
> https://sourceware.org/ml/libc-alpha/2014-04/msg00661.html
> 
> You should be in cc as well.  Your analysis is correct and your fix should
> work too, but I went for a different approach in the fix because
> NSS_STATUS_TRYAGAIN is indeed the correct status in such cases.  The
> netgroups bits used NSS_STATUS_UNAVAIL incorrectly.

Cool. Thanks! Does anything need doing to have this backported to RHEL6, i.e. have the customer open a Call with RedHat or somesuch?

Bye,
Michael

Comment 8 Siddhesh Poyarekar 2014-04-30 10:46:47 UTC

(In reply to Michael Weiser from comment #7)
> Sorry, momentary loss of all brain functions. Sincere apologies.

No worries :)

> Cool. Thanks! Does anything need doing to have this backported to RHEL6,
> i.e. have the customer open a Call with RedHat or somesuch?

Raising a ticket with Red Hat technical support would be beneficial because it helps prioritize the bug correctly.

Comment 9 Michael Weiser 2014-05-02 09:23:12 UTC

> Raising a ticket with Red Hat technical support would be beneficial because
> it helps prioritize the bug correctly.

Done. Case # 01084463.

Comment 12 errata-xmlrpc 2014-10-14 04:43:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-1391.html

Note You need to log in before you can comment on or make changes to this bug.