Bug 1087833

Summary: nscd-2.12-1.132.el6 enters busy loop on long netgroup entry via nss_ldap of nslcd
Product: Red Hat Enterprise Linux 6 Reporter: Michael Weiser <m.weiser>
Component: glibcAssignee: Siddhesh Poyarekar <spoyarek>
Status: CLOSED ERRATA QA Contact: Arjun Shankar <ashankar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.5CC: ashankar, codonell, fweimer, mfranc, mnewsome, pfrankli, spoyarek
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glibc-2.12-1.144.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1173537 (view as bug list) Environment:
Last Closed: 2014-10-14 04:43:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1173537    
Attachments:
Description Flags
fix nscd tryagain busy loop none

Description Michael Weiser 2014-04-15 12:24:52 UTC
Description of problem:

If there's a long (>1024 bytes) netgroup entry retrieved via nslcd's nss_ldap, nscd-2.12-1.132.el6 with netgroup caching enabled will enter a busy loop, hoging a CPU. Each repetition causes another nscd thread to busy loop, consequently using up all available CPU time.

This is caused by nss_ldap returning NSS_STATUS_TRYAGAIN if the buffer provided by nscd is not large enough to hold the result while nscd is expecting it to return NSS_STATUS_UNAVAIL. So, instead of resizing its buffer and calling nss_getnetgrent again, it loops over without doing anything.

The attached patch fixes nscd for me.

Alternatively, nslcd could be changed not to return NSS_STATUS_TRYAGAIN, but upstream seems to have put some thought into its behaviour: https://github.com/arthurdejong/nss-pam-ldapd/blob/master/nss/common.h#L59

Version-Release number of selected component (if applicable):
nscd-2.12-1.132.el6

How reproducible:
always

Steps to Reproduce:
1. add long netgroup entry to LDAP
2. enable netgroup caching in nscd
3. getent netgroup longnetgroupentry

Actual results:
nscd thread enters busy loop, getent times out waiting for nscd and gets netgroup data directly

Expected results:
nscd returns netgroup data

Additional info:
This bug is still present in upstream glibc HEAD and probably should be pushed upstream as well.

Workaround:
Disable netgroup caching in /etc/nscd.conf.
Possibly (untested): Split netgroup into multiple, recursive groups, thus avoiding a single, very large netgroup entry.

Comment 1 Michael Weiser 2014-04-15 12:25:44 UTC
Created attachment 886459 [details]
fix nscd tryagain busy loop

Comment 3 Siddhesh Poyarekar 2014-04-30 04:34:11 UTC
I was able to reproduce the infinite loop by having a very long combination of user, host and domain in a single triplet, but that triplet does not give me a valid result without nscd.  This was expected because a valid triplet should fit in 1K given that the components of the triplet (i.e. the hostname, username and domain name) have defined limits well within 1K.

The getent command does not work without nscd because it uses getnetgrent(), which in turn assumes this static limit of 1K and fails.  Given that ldap supports such long entries, there could be a case for adding support for such long entries, but adding such support would mean enhancing getnetgrent as well.

Of course, I'd like to know if you're seeing the same scenario that I described, i.e. the netgroup coming up empty without nscd.  If it is not (which I assume it is since you mentioned the timeout and the direct query resulting in the correct output) then could you share a sample netgroup entry that we can use to try and figure out what is different?

Comment 4 Michael Weiser 2014-04-30 08:36:59 UTC
Hi Padesh,

> Of course, I'd like to know if you're seeing the same scenario that I
> described, i.e. the netgroup coming up empty without nscd.  If it is not

I've played around with very large triplets as well and see segfaults with that. See https://bugzilla.redhat.com/show_bug.cgi?id=1087838. But that's not what I see and do with this LDAP bug.

> (which I assume it is since you mentioned the timeout and the direct query
> resulting in the correct output) then could you share a sample netgroup
> entry that we can use to try and figure out what is different?

nscd.conf:

[root@test ~]# grep netgroup /etc/nscd.conf
        enable-cache            netgroup        yes
        positive-time-to-live   netgroup        28800
        negative-time-to-live   netgroup        20
        suggested-size          netgroup        211
        check-files             netgroup        yes
        persistent              netgroup        yes
        shared                  netgroup        yes
        max-db-size             netgroup        33554432

Two test netgroups:

[root@test ~]# ldapsearch -H ldaps://ldapserver.domain:636/ -b dc=domain -xLLL cn=test
dn: cn=test,ou=netgroup,dc=domain
cn: test
objectClass: top
objectClass: nisNetgroup
nisNetgroupTriple: (test1,-,test)
nisNetgroupTriple: (test10,-,test)
nisNetgroupTriple: (test11,-,test)
nisNetgroupTriple: (test12,-,test)
nisNetgroupTriple: (test13,-,test)
nisNetgroupTriple: (test14,-,test)
nisNetgroupTriple: (test15,-,test)
nisNetgroupTriple: (test16,-,test)
nisNetgroupTriple: (test17,-,test)
nisNetgroupTriple: (test18,-,test)
nisNetgroupTriple: (test19,-,test)
nisNetgroupTriple: (test2,-,test)
nisNetgroupTriple: (test20,-,test)
nisNetgroupTriple: (test21,-,test)
nisNetgroupTriple: (test22,-,test)
nisNetgroupTriple: (test23,-,test)
nisNetgroupTriple: (test24,-,test)
nisNetgroupTriple: (test25,-,test)
nisNetgroupTriple: (test26,-,test)
nisNetgroupTriple: (test27,-,test)
nisNetgroupTriple: (test28,-,test)
nisNetgroupTriple: (test29,-,test)
nisNetgroupTriple: (test3,-,test)
nisNetgroupTriple: (test30,-,test)
nisNetgroupTriple: (test4,-,test)
nisNetgroupTriple: (test5,-,test)
nisNetgroupTriple: (test6,-,test)
nisNetgroupTriple: (test7,-,test)
nisNetgroupTriple: (test8,-,test)
nisNetgroupTriple: (test9,-,test)
nisNetgroupTriple: (test31,-,test)
nisNetgroupTriple: (test32,-,test)
nisNetgroupTriple: (test33,-,test)
nisNetgroupTriple: (test34,-,test)
nisNetgroupTriple: (test35,-,test)
nisNetgroupTriple: (test36,-,test)
nisNetgroupTriple: (test37,-,test)
nisNetgroupTriple: (test38,-,test)
nisNetgroupTriple: (test39,-,test)
nisNetgroupTriple: (test40,-,test)
nisNetgroupTriple: (test41,-,test)
nisNetgroupTriple: (test42,-,test)
nisNetgroupTriple: (test43,-,test)
nisNetgroupTriple: (test44,-,test)
nisNetgroupTriple: (test45,-,test)
nisNetgroupTriple: (test46,-,test)
nisNetgroupTriple: (test47,-,test)
nisNetgroupTriple: (test48,-,test)
nisNetgroupTriple: (test49,-,test)
nisNetgroupTriple: (test50,-,test)
nisNetgroupTriple: (test51,-,test)
nisNetgroupTriple: (test52,-,test)
nisNetgroupTriple: (test53,-,test)
nisNetgroupTriple: (test54,-,test)
nisNetgroupTriple: (test55,-,test)
nisNetgroupTriple: (test56,-,test)
nisNetgroupTriple: (test57,-,test)
nisNetgroupTriple: (test58,-,test)
nisNetgroupTriple: (test59,-,test)
nisNetgroupTriple: (test60,-,test)
nisNetgroupTriple: (test61,-,test)
nisNetgroupTriple: (test62,-,test)
nisNetgroupTriple: (test63,-,test)
nisNetgroupTriple: (test64,-,test)
nisNetgroupTriple: (test65,-,test)
nisNetgroupTriple: (test66,-,test)
nisNetgroupTriple: (test67,-,test)
nisNetgroupTriple: (test68,-,test)
nisNetgroupTriple: (test69,-,test)
nisNetgroupTriple: (test70,-,test)
nisNetgroupTriple: (test71,-,test)

dn: cn=test2,ou=netgroup,dc=domain
cn: test2
objectClass: top
objectClass: nisNetgroup
nisNetgroupTriple: (test1,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test10,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test11,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test12,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test13,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test14,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)
nisNetgroupTriple: (test2,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test3,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test4,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test5,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test6,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test7,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test8,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test9,-,alongerdomaintoneedlessnetgroupentriestotriggerthe
 problem)
nisNetgroupTriple: (test15,-,alongerdomaintoneedlessnetgroupentriestotriggerth
 eproblem)

Restart nscd with clean cache and getent the groups while timing how long that takes:

[root@test ~]# killall nscd ; rm -f /var/db/nscd/* ; nscd
nscd: no process killed
[root@test ~]# time getent netgroup test
test                  (test1,-,test) (test10,-,test) (test11,-,test) (test12,-,test) (test13,-,test) (test14,-,test) (test15,-,test) (test16,-,test) (test17,-,test) (test18,-,test) (test19,-,test) (test2,-,test) (test20,-,test) (test21,-,test) (test22,-,test) (test23,-,test) (test24,-,test) (test25,-,test) (test26,-,test) (test27,-,test) (test28,-,test) (test29,-,test) (test3,-,test) (test30,-,test) (test4,-,test) (test5,-,test) (test6,-,test) (test7,-,test) (test8,-,test) (test9,-,test) (test31,-,test) (test32,-,test) (test33,-,test) (test34,-,test) (test35,-,test) (test36,-,test) (test37,-,test) (test38,-,test) (test39,-,test) (test40,-,test) (test41,-,test) (test42,-,test) (test43,-,test) (test44,-,test) (test45,-,test) (test46,-,test) (test47,-,test) (test48,-,test) (test49,-,test) (test50,-,test) (test51,-,test) (test52,-,test) (test53,-,test) (test54,-,test) (test55,-,test) (test56,-,test) (test57,-,test) (test58,-,test) (test59,-,test) (test60,-,test) (test61,-,test) (test62,-,test) (test63,-,test) (test64,-,test) (test65,-,test) (test66,-,test) (test67,-,test) (test68,-,test) (test69,-,test) (test70,-,test) (test71,-,test)

real    0m5.007s
user    0m0.000s
sys     0m0.002s
[root@test ~]# time getent netgroup test2
test2                 (test1,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test10,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test11,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test12,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test13,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test14,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test2,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test3,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test4,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test5,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test6,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test7,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test8,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test9,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test15,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem)

real    0m5.003s
user    0m0.002s
sys     0m0.000s

nscd is hogging two CPUs now:

[root@test ~]# top -n 1 -b | head -8
top - 10:13:37 up 19 days, 32 min,  2 users,  load average: 13.31, 12.52, 12.21
Tasks: 559 total,  13 running, 545 sleeping,   0 stopped,   1 zombie
Cpu(s): 11.6%us, 10.5%sy, 15.6%ni, 62.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65922808k total, 15001704k used, 50921104k free,   291304k buffers
Swap: 33030136k total,        0k used, 33030136k free,  9209884k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23029 nscd      20   0  619m 1268  892 S 198.7  0.0   0:33.44 nscd

Size of the entries returned:

[root@test ~]# getent netgroup test | wc
      1      72    1149
[root@test ~]# getent netgroup test2 | wc
      1      16    1048

Remove test71 from netgroup test and test15 from netgroup test2 and run the test again:

[root@test ~]# killall nscd ; rm -f /var/db/nscd/* ; nscd
nscd: no process killed
[root@test ~]# time getent netgroup test
test                  (test1,-,test) (test10,-,test) (test11,-,test) (test12,-,test) (test13,-,test) (test14,-,test) (test15,-,test) (test16,-,test) (test17,-,test) (test18,-,test) (test19,-,test) (test2,-,test) (test20,-,test) (test21,-,test) (test22,-,test) (test23,-,test) (test24,-,test) (test25,-,test) (test26,-,test) (test27,-,test) (test28,-,test) (test29,-,test) (test3,-,test) (test30,-,test) (test4,-,test) (test5,-,test) (test6,-,test) (test7,-,test) (test8,-,test) (test9,-,test) (test31,-,test) (test32,-,test) (test33,-,test) (test34,-,test) (test35,-,test) (test36,-,test) (test37,-,test) (test38,-,test) (test39,-,test) (test40,-,test) (test41,-,test) (test42,-,test) (test43,-,test) (test44,-,test) (test45,-,test) (test46,-,test) (test47,-,test) (test48,-,test) (test49,-,test) (test50,-,test) (test51,-,test) (test52,-,test) (test53,-,test) (test54,-,test) (test55,-,test) (test56,-,test) (test57,-,test) (test58,-,test) (test59,-,test) (test60,-,test) (test61,-,test) (test62,-,test) (test63,-,test) (test64,-,test) (test65,-,test) (test66,-,test) (test67,-,test) (test68,-,test) (test69,-,test) (test70,-,test)

real    0m0.003s
user    0m0.000s
sys     0m0.001s
[root@test ~]# time getent netgroup test2
test2                 (test1,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test10,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test11,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test12,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test13,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test14,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test2,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test3,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test4,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test5,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test6,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test7,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test8,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test9,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem)

real    0m0.002s
user    0m0.000s
sys     0m0.000s
[root@test ~]# top -n 1 -b | head -8
top - 10:26:48 up 19 days, 45 min,  2 users,  load average: 12.55, 13.12, 12.87
Tasks: 555 total,  13 running, 541 sleeping,   0 stopped,   1 zombie
Cpu(s): 11.6%us, 10.5%sy, 15.6%ni, 62.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65922808k total, 15526096k used, 50396712k free,   322776k buffers
Swap: 33030136k total,        0k used, 33030136k free,  9732668k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20084 user  39  19  475m 372m  21m R 99.0  0.6 340:26.48 solver

Length of entries:

[root@test ~]# getent netgroup test | wc
      1      71    1133
[root@test ~]# getent netgroup test2 | wc
      1      15     979

So, any sufficiently large netgroup seems to do it, although it doesn't happen exactly at exceeding 1024 bytes of length.

Hope that helps,
Michael

Comment 5 Siddhesh Poyarekar 2014-04-30 09:15:09 UTC
(In reply to Michael Weiser from comment #4)
> Hi Padesh,

That's not me :)

> Two test netgroups:

Thanks, that helped.  I have posted a patch upstream for review:

https://sourceware.org/ml/libc-alpha/2014-04/msg00661.html

You should be in cc as well.  Your analysis is correct and your fix should work too, but I went for a different approach in the fix because NSS_STATUS_TRYAGAIN is indeed the correct status in such cases.  The netgroups bits used NSS_STATUS_UNAVAIL incorrectly.

Comment 7 Michael Weiser 2014-04-30 09:38:44 UTC
Hello *Siddhesh*,

> > Hi Padesh,
> 
> That's not me :)

Sorry, momentary loss of all brain functions. Sincere apologies.

> > Two test netgroups:
> 
> Thanks, that helped.  I have posted a patch upstream for review:
> 
> https://sourceware.org/ml/libc-alpha/2014-04/msg00661.html
> 
> You should be in cc as well.  Your analysis is correct and your fix should
> work too, but I went for a different approach in the fix because
> NSS_STATUS_TRYAGAIN is indeed the correct status in such cases.  The
> netgroups bits used NSS_STATUS_UNAVAIL incorrectly.

Cool. Thanks! Does anything need doing to have this backported to RHEL6, i.e. have the customer open a Call with RedHat or somesuch?

Bye,
Michael

Comment 8 Siddhesh Poyarekar 2014-04-30 10:46:47 UTC
(In reply to Michael Weiser from comment #7)
> Sorry, momentary loss of all brain functions. Sincere apologies.

No worries :)

> Cool. Thanks! Does anything need doing to have this backported to RHEL6,
> i.e. have the customer open a Call with RedHat or somesuch?

Raising a ticket with Red Hat technical support would be beneficial because it helps prioritize the bug correctly.

Comment 9 Michael Weiser 2014-05-02 09:23:12 UTC
> Raising a ticket with Red Hat technical support would be beneficial because
> it helps prioritize the bug correctly.

Done. Case # 01084463.

Comment 12 errata-xmlrpc 2014-10-14 04:43:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-1391.html