From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.0rc1) Gecko/20020418 Description of problem: After encountering a large group in group.db, getgrent() will spin after the last entry, consuming CPU and memory without bound. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Create a long group entry (~1000 characters) anywhere in group source file 2. Set group entry in nsswitch.conf to include "db" (mine says "files db") 3. Make db files in /var/db 4. Run a program that calls getgrent iteratively, like "id -Gn user" or "perl -e 'setgrent; while(@ent=getgrent) { print join(":", @ent), "\n"; } print "DONE\n"; endgrent;' Actual Results: The perl one-liner listed above, for instance, gets past the large group, and all the groups, but never to the DONE statement -- it hangs on getgrent. Simpler things like 'id -Gn' seem to hang as well. In any case, the processes are not merely hanging, but looping and consuming CPU and memory, seemingly without bounds (I killed them after 1 GB). Presumably they would get ENOMEM after consuming all memory swap, but the system isn't exactly happy at that point. Expected Results: The last getgrent() should return NULL, and processes should therefore know that they've reached the end of the group list. Additional info: I was able to produce similar behavior on RedHat 6.2, with spinning CPU consumption, but without the ever-growing memory. I was not able to reproduce the problem on a Debian machine. I therefore assume that this is a bug in either one of the patches RedHat applies or in db4, which RedHat seems to be using. The problem occurs no matter the position in which the large group appears in the db, so long as it is there. Without it, the problem goes away. It appears to be tied to the length of the entry, rather than, for instance, the number of users in the entry. It does not occur with a large entry in the group flat file (that's why I think it's nss_db). The spinning and memory consumption does not occur until the program calls getgrent() *after* the getgrent() which has returned the last group. I'm no expert on these things, but here's my shaky theory of what I think is happening, in my rather primitive understanding of these things: glibc has a wrapper function [__nss_getent() in nss/getnssent.c] that calls the nss_db internal version of getgrent() [lookup() in db-XXX.c]. It passes the internal function a buffer. If the group is too big to fit in the buffer provided, the internal function returns an error and sets errno to ERANGE. When the wrapper sees that the internal function has returned an error, it checks to see if errno is set to ERANGE; if so, then it reallocs the buffer and tries again. Once the errno is set to ERANGE, it does not get reset upon success. When lookup() tries to look up the record after the last record in the db, db->get returns 1 to indicate that the record does not exist, but, because of the nss_db-2.2-compat.patch, errno does not get reset. Therefore, the wrapper loops, and keeps realloc-ing the buffer. What I can't figure out is why this would happen in nss_db without the compat, so maybe the theory is just a bunch of bunk. Anyway, that's my current theory, but I'm unable to verify it, because even "rpm --rebuild nss_db-2.2-14.src.rpm" is erroring out with: + popd /usr/src/redhat/BUILD/nss_db-2.2 + CFLAGS=-O2 -march=i386 -mcpu=i686 + export CFLAGS + CXXFLAGS=-O2 -march=i386 -mcpu=i686 + export CXXFLAGS + FFLAGS=-O2 -march=i386 -mcpu=i686 + export FFLAGS + ./configure i386-redhat-linux --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man --infodir=/usr/share/info --with-db=/usr/src/redhat/BUILD/nss_db-2.2/db-instroot [...] checking for db.h... yes checking for db_version in -ldb... no configure: error: *** Could not find Berkeley DB library. error: Bad exit status from /var/tmp/rpm-tmp.3344 (%build) from config.log: configure: failed program was: #line 5345 "configure" #include "confdefs.h" /* Override any gcc2 internal prototype to avoid an error. */ /* We use char because int might match the return type of a gcc2 builtin and then its argument prototype would still apply. */ char db_version(); int main() { db_version() ; return 0; } # nm /usr/src/redhat/BUILD/nss_db-2.2/db-instroot/lib/libdb.a | grep db_version 00000004 T db_version_nssdb U db_version_nssdb ...so I believe that has something to do with the fact that db_version has been renamed db_version_nssdb in your included db4. Maybe whoever was working on this had another db library installed on the system, so configure found that one (like I said, I'm no expert, so who knows -- maybe it's all my fault). At this point, however, it's probably time that I turn this over to you guys, before I head off on any more wild goose chases. I've rated this as severity high, because it seems to qualify as a serious memory leak, and it can rapidly take a system down into swapping hell.
Created attachment 59904 [details] sloppy configure patch that gets configure to work for nss_db
Created attachment 59905 [details] sloppy patch to db-XXX.c that seems to solve the problem
OK, so my initial hypothesis about this being a problem with RedHat's patch seems to have been wrong (due to a misunderstanding on my part about db->get's return values). it seems this bug may exist in the sources straight from GNU (which doesn't explain my success with Debian, but whatever, I was tired). the patch attached [db.patch] makes lookup() in db-XXX.c set errno to ENOENT if a lookup fails. without this, errno remains set to ERANGE, and __nss_getent() loops, realloc()-ing 'till the cows come home, as described in my initial report. however, the patch is incomplete, since it only sets errno in the one case ("case DB_NOTFOUND:") that matters to me. given nss_getent's expectations, it should, IMHO, probably set errno to an appropriate value in other cases as well, and someone with closer knowledge of this package should probably have it do that, lest someone else have similar problems. i'd take care of case default, too, but i'm unsure of what a good default errno would be, which is why i am deferring on the matter.
Red Hat apologizes that these issues have not been resolved yet. We do want to make sure that no important bugs slip through the cracks. Red Hat Linux 7.3 and Red Hat Linux 9 are no longer supported by Red Hat, Inc. They are maintained by the Fedora Legacy project (http://www.fedoralegacy.org/) for security updates only. If this is a security issue, please reassign to the 'Fedora Legacy' product in bugzilla. Please note that Legacy security update support for these products will stop on December 31st, 2006. If this is not a security issue, please check if this issue is still present in a current Fedora Core release. If so, please change the product and version to match, and check the box indicating that the requested information has been provided. If you are currently still running Red Hat Linux 7.3 or 9, please note that Fedora Legacy security update support for these products will stop on December 31st, 2006. You are strongly advised to upgrade to a current Fedora Core release or Red Hat Enterprise Linux or comparable. Some information on which option may be right for you is available at http://www.redhat.com/rhel/migrate/redhatlinux/. Any bug still open against Red Hat Linux 7.3 or 9 at the end of 2006 will be closed 'CANTFIX'. Again, if this bug still exists in a current release, or is a security issue, please change the product as necessary. We thank you for your help, and apologize again that we haven't handled these issues to this point.
Red Hat Linux is no longer supported by Red Hat, Inc. If you are still running Red Hat Linux, you are strongly advised to upgrade to a current Fedora Core release or Red Hat Enterprise Linux or comparable. Some information on which option may be right for you is available at http://www.redhat.com/rhel/migrate/redhatlinux/. Closing as CANTFIX.