Bug 1087833
Summary: | nscd-2.12-1.132.el6 enters busy loop on long netgroup entry via nss_ldap of nslcd | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Michael Weiser <m.weiser> | ||||
Component: | glibc | Assignee: | Siddhesh Poyarekar <spoyarek> | ||||
Status: | CLOSED ERRATA | QA Contact: | Arjun Shankar <ashankar> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.5 | CC: | ashankar, codonell, fweimer, mfranc, mnewsome, pfrankli, spoyarek | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glibc-2.12-1.144.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1173537 (view as bug list) | Environment: | |||||
Last Closed: | 2014-10-14 04:43:43 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1173537 | ||||||
Attachments: |
|
Description
Michael Weiser
2014-04-15 12:24:52 UTC
Created attachment 886459 [details]
fix nscd tryagain busy loop
I was able to reproduce the infinite loop by having a very long combination of user, host and domain in a single triplet, but that triplet does not give me a valid result without nscd. This was expected because a valid triplet should fit in 1K given that the components of the triplet (i.e. the hostname, username and domain name) have defined limits well within 1K. The getent command does not work without nscd because it uses getnetgrent(), which in turn assumes this static limit of 1K and fails. Given that ldap supports such long entries, there could be a case for adding support for such long entries, but adding such support would mean enhancing getnetgrent as well. Of course, I'd like to know if you're seeing the same scenario that I described, i.e. the netgroup coming up empty without nscd. If it is not (which I assume it is since you mentioned the timeout and the direct query resulting in the correct output) then could you share a sample netgroup entry that we can use to try and figure out what is different? Hi Padesh, > Of course, I'd like to know if you're seeing the same scenario that I > described, i.e. the netgroup coming up empty without nscd. If it is not I've played around with very large triplets as well and see segfaults with that. See https://bugzilla.redhat.com/show_bug.cgi?id=1087838. But that's not what I see and do with this LDAP bug. > (which I assume it is since you mentioned the timeout and the direct query > resulting in the correct output) then could you share a sample netgroup > entry that we can use to try and figure out what is different? nscd.conf: [root@test ~]# grep netgroup /etc/nscd.conf enable-cache netgroup yes positive-time-to-live netgroup 28800 negative-time-to-live netgroup 20 suggested-size netgroup 211 check-files netgroup yes persistent netgroup yes shared netgroup yes max-db-size netgroup 33554432 Two test netgroups: [root@test ~]# ldapsearch -H ldaps://ldapserver.domain:636/ -b dc=domain -xLLL cn=test dn: cn=test,ou=netgroup,dc=domain cn: test objectClass: top objectClass: nisNetgroup nisNetgroupTriple: (test1,-,test) nisNetgroupTriple: (test10,-,test) nisNetgroupTriple: (test11,-,test) nisNetgroupTriple: (test12,-,test) nisNetgroupTriple: (test13,-,test) nisNetgroupTriple: (test14,-,test) nisNetgroupTriple: (test15,-,test) nisNetgroupTriple: (test16,-,test) nisNetgroupTriple: (test17,-,test) nisNetgroupTriple: (test18,-,test) nisNetgroupTriple: (test19,-,test) nisNetgroupTriple: (test2,-,test) nisNetgroupTriple: (test20,-,test) nisNetgroupTriple: (test21,-,test) nisNetgroupTriple: (test22,-,test) nisNetgroupTriple: (test23,-,test) nisNetgroupTriple: (test24,-,test) nisNetgroupTriple: (test25,-,test) nisNetgroupTriple: (test26,-,test) nisNetgroupTriple: (test27,-,test) nisNetgroupTriple: (test28,-,test) nisNetgroupTriple: (test29,-,test) nisNetgroupTriple: (test3,-,test) nisNetgroupTriple: (test30,-,test) nisNetgroupTriple: (test4,-,test) nisNetgroupTriple: (test5,-,test) nisNetgroupTriple: (test6,-,test) nisNetgroupTriple: (test7,-,test) nisNetgroupTriple: (test8,-,test) nisNetgroupTriple: (test9,-,test) nisNetgroupTriple: (test31,-,test) nisNetgroupTriple: (test32,-,test) nisNetgroupTriple: (test33,-,test) nisNetgroupTriple: (test34,-,test) nisNetgroupTriple: (test35,-,test) nisNetgroupTriple: (test36,-,test) nisNetgroupTriple: (test37,-,test) nisNetgroupTriple: (test38,-,test) nisNetgroupTriple: (test39,-,test) nisNetgroupTriple: (test40,-,test) nisNetgroupTriple: (test41,-,test) nisNetgroupTriple: (test42,-,test) nisNetgroupTriple: (test43,-,test) nisNetgroupTriple: (test44,-,test) nisNetgroupTriple: (test45,-,test) nisNetgroupTriple: (test46,-,test) nisNetgroupTriple: (test47,-,test) nisNetgroupTriple: (test48,-,test) nisNetgroupTriple: (test49,-,test) nisNetgroupTriple: (test50,-,test) nisNetgroupTriple: (test51,-,test) nisNetgroupTriple: (test52,-,test) nisNetgroupTriple: (test53,-,test) nisNetgroupTriple: (test54,-,test) nisNetgroupTriple: (test55,-,test) nisNetgroupTriple: (test56,-,test) nisNetgroupTriple: (test57,-,test) nisNetgroupTriple: (test58,-,test) nisNetgroupTriple: (test59,-,test) nisNetgroupTriple: (test60,-,test) nisNetgroupTriple: (test61,-,test) nisNetgroupTriple: (test62,-,test) nisNetgroupTriple: (test63,-,test) nisNetgroupTriple: (test64,-,test) nisNetgroupTriple: (test65,-,test) nisNetgroupTriple: (test66,-,test) nisNetgroupTriple: (test67,-,test) nisNetgroupTriple: (test68,-,test) nisNetgroupTriple: (test69,-,test) nisNetgroupTriple: (test70,-,test) nisNetgroupTriple: (test71,-,test) dn: cn=test2,ou=netgroup,dc=domain cn: test2 objectClass: top objectClass: nisNetgroup nisNetgroupTriple: (test1,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test10,-,alongerdomaintoneedlessnetgroupentriestotriggerth eproblem) nisNetgroupTriple: (test11,-,alongerdomaintoneedlessnetgroupentriestotriggerth eproblem) nisNetgroupTriple: (test12,-,alongerdomaintoneedlessnetgroupentriestotriggerth eproblem) nisNetgroupTriple: (test13,-,alongerdomaintoneedlessnetgroupentriestotriggerth eproblem) nisNetgroupTriple: (test14,-,alongerdomaintoneedlessnetgroupentriestotriggerth eproblem) nisNetgroupTriple: (test2,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test3,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test4,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test5,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test6,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test7,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test8,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test9,-,alongerdomaintoneedlessnetgroupentriestotriggerthe problem) nisNetgroupTriple: (test15,-,alongerdomaintoneedlessnetgroupentriestotriggerth eproblem) Restart nscd with clean cache and getent the groups while timing how long that takes: [root@test ~]# killall nscd ; rm -f /var/db/nscd/* ; nscd nscd: no process killed [root@test ~]# time getent netgroup test test (test1,-,test) (test10,-,test) (test11,-,test) (test12,-,test) (test13,-,test) (test14,-,test) (test15,-,test) (test16,-,test) (test17,-,test) (test18,-,test) (test19,-,test) (test2,-,test) (test20,-,test) (test21,-,test) (test22,-,test) (test23,-,test) (test24,-,test) (test25,-,test) (test26,-,test) (test27,-,test) (test28,-,test) (test29,-,test) (test3,-,test) (test30,-,test) (test4,-,test) (test5,-,test) (test6,-,test) (test7,-,test) (test8,-,test) (test9,-,test) (test31,-,test) (test32,-,test) (test33,-,test) (test34,-,test) (test35,-,test) (test36,-,test) (test37,-,test) (test38,-,test) (test39,-,test) (test40,-,test) (test41,-,test) (test42,-,test) (test43,-,test) (test44,-,test) (test45,-,test) (test46,-,test) (test47,-,test) (test48,-,test) (test49,-,test) (test50,-,test) (test51,-,test) (test52,-,test) (test53,-,test) (test54,-,test) (test55,-,test) (test56,-,test) (test57,-,test) (test58,-,test) (test59,-,test) (test60,-,test) (test61,-,test) (test62,-,test) (test63,-,test) (test64,-,test) (test65,-,test) (test66,-,test) (test67,-,test) (test68,-,test) (test69,-,test) (test70,-,test) (test71,-,test) real 0m5.007s user 0m0.000s sys 0m0.002s [root@test ~]# time getent netgroup test2 test2 (test1,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test10,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test11,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test12,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test13,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test14,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test2,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test3,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test4,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test5,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test6,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test7,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test8,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test9,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test15,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) real 0m5.003s user 0m0.002s sys 0m0.000s nscd is hogging two CPUs now: [root@test ~]# top -n 1 -b | head -8 top - 10:13:37 up 19 days, 32 min, 2 users, load average: 13.31, 12.52, 12.21 Tasks: 559 total, 13 running, 545 sleeping, 0 stopped, 1 zombie Cpu(s): 11.6%us, 10.5%sy, 15.6%ni, 62.2%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 65922808k total, 15001704k used, 50921104k free, 291304k buffers Swap: 33030136k total, 0k used, 33030136k free, 9209884k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23029 nscd 20 0 619m 1268 892 S 198.7 0.0 0:33.44 nscd Size of the entries returned: [root@test ~]# getent netgroup test | wc 1 72 1149 [root@test ~]# getent netgroup test2 | wc 1 16 1048 Remove test71 from netgroup test and test15 from netgroup test2 and run the test again: [root@test ~]# killall nscd ; rm -f /var/db/nscd/* ; nscd nscd: no process killed [root@test ~]# time getent netgroup test test (test1,-,test) (test10,-,test) (test11,-,test) (test12,-,test) (test13,-,test) (test14,-,test) (test15,-,test) (test16,-,test) (test17,-,test) (test18,-,test) (test19,-,test) (test2,-,test) (test20,-,test) (test21,-,test) (test22,-,test) (test23,-,test) (test24,-,test) (test25,-,test) (test26,-,test) (test27,-,test) (test28,-,test) (test29,-,test) (test3,-,test) (test30,-,test) (test4,-,test) (test5,-,test) (test6,-,test) (test7,-,test) (test8,-,test) (test9,-,test) (test31,-,test) (test32,-,test) (test33,-,test) (test34,-,test) (test35,-,test) (test36,-,test) (test37,-,test) (test38,-,test) (test39,-,test) (test40,-,test) (test41,-,test) (test42,-,test) (test43,-,test) (test44,-,test) (test45,-,test) (test46,-,test) (test47,-,test) (test48,-,test) (test49,-,test) (test50,-,test) (test51,-,test) (test52,-,test) (test53,-,test) (test54,-,test) (test55,-,test) (test56,-,test) (test57,-,test) (test58,-,test) (test59,-,test) (test60,-,test) (test61,-,test) (test62,-,test) (test63,-,test) (test64,-,test) (test65,-,test) (test66,-,test) (test67,-,test) (test68,-,test) (test69,-,test) (test70,-,test) real 0m0.003s user 0m0.000s sys 0m0.001s [root@test ~]# time getent netgroup test2 test2 (test1,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test10,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test11,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test12,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test13,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test14,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test2,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test3,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test4,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test5,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test6,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test7,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test8,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) (test9,-,alongerdomaintoneedlessnetgroupentriestotriggertheproblem) real 0m0.002s user 0m0.000s sys 0m0.000s [root@test ~]# top -n 1 -b | head -8 top - 10:26:48 up 19 days, 45 min, 2 users, load average: 12.55, 13.12, 12.87 Tasks: 555 total, 13 running, 541 sleeping, 0 stopped, 1 zombie Cpu(s): 11.6%us, 10.5%sy, 15.6%ni, 62.2%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 65922808k total, 15526096k used, 50396712k free, 322776k buffers Swap: 33030136k total, 0k used, 33030136k free, 9732668k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20084 user 39 19 475m 372m 21m R 99.0 0.6 340:26.48 solver Length of entries: [root@test ~]# getent netgroup test | wc 1 71 1133 [root@test ~]# getent netgroup test2 | wc 1 15 979 So, any sufficiently large netgroup seems to do it, although it doesn't happen exactly at exceeding 1024 bytes of length. Hope that helps, Michael (In reply to Michael Weiser from comment #4) > Hi Padesh, That's not me :) > Two test netgroups: Thanks, that helped. I have posted a patch upstream for review: https://sourceware.org/ml/libc-alpha/2014-04/msg00661.html You should be in cc as well. Your analysis is correct and your fix should work too, but I went for a different approach in the fix because NSS_STATUS_TRYAGAIN is indeed the correct status in such cases. The netgroups bits used NSS_STATUS_UNAVAIL incorrectly. Hello *Siddhesh*, > > Hi Padesh, > > That's not me :) Sorry, momentary loss of all brain functions. Sincere apologies. > > Two test netgroups: > > Thanks, that helped. I have posted a patch upstream for review: > > https://sourceware.org/ml/libc-alpha/2014-04/msg00661.html > > You should be in cc as well. Your analysis is correct and your fix should > work too, but I went for a different approach in the fix because > NSS_STATUS_TRYAGAIN is indeed the correct status in such cases. The > netgroups bits used NSS_STATUS_UNAVAIL incorrectly. Cool. Thanks! Does anything need doing to have this backported to RHEL6, i.e. have the customer open a Call with RedHat or somesuch? Bye, Michael (In reply to Michael Weiser from comment #7) > Sorry, momentary loss of all brain functions. Sincere apologies. No worries :) > Cool. Thanks! Does anything need doing to have this backported to RHEL6, > i.e. have the customer open a Call with RedHat or somesuch? Raising a ticket with Red Hat technical support would be beneficial because it helps prioritize the bug correctly. > Raising a ticket with Red Hat technical support would be beneficial because
> it helps prioritize the bug correctly.
Done. Case # 01084463.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-1391.html |