We've spotted a problem in DS 6.21 SP3, though I believe the same thing should happen with DS 7.1 SP2 since it has the same fix: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=173687 Here's the problem scenario: 1) A new connection is made to the Directory Server. The main daemon loop calls PR_Accept(). 2) PR_Accept() returns an error causing an error log message to be printed. This may be more common on HP-UX, as every so often PR_Accept() error message is printed in the error log regarding insufficient resources. This can be caused during heavy load and an impatient and "when the transport stack received RST just after SYN was received .. [and] RST comes before accept()". This is unrelated, the point is that we're printing an error message from the main accept loop. 3) The time has come to rotate the error log; before the PR_Accept() error message can be printed, we must rotate the log. 4) With the 173687 fix, the error log rotation code calls getpwnam() to get the passwd structure of the run-time user of the Directory Server in order to fix the ownership before rotating. 5) nsswitch is configured to use ldap before files. 6) The entity handling the ldap backend for the name service (LDAP-UX in our case) attempts to contact Directory Server to see if the local user exists in the LDAP backend. This connection hangs since the daemon loop is hung and not accepting connections. 7) Any further connections to the Directory Server also hang since the daemon loop is blocked, waiting on the error log rotation to complete. There are workarounds: extend the error log rotation period to delay the problem, ensure the ldap backend is not used for the lookup of the nsslapd-localuser, and set a timeout on the entity handling the ldap name service lookup. But calling getpwnam() when we're up and running seems like an invitation to cyclical dependency problems since one common use of DS is as a name service. Any ideas for a permanent fix? Maybe we can stash the uid and gid of the nsslapd-localuser in some global when we use getpwnam() at startup?
Created attachment 161709 [details] cvs diff (ldapserver) Thanks, Ulf, for reporting the problem and the suggestion to fix the bug. Following your advice, I'm moving the getpwnam to the startup time and introducing slapdFrontendConfig->config_set_localuser to store the copy of "struct pw", which is reused unless localuser is reset. If you could review the diffs, I'd appreciate it. Fix description: protect_db.c -- instead of calling getpwnam, use slapdFrontendConfig->localuserinfo. main.c -- set slapdFrontendConfig->localuserinfo if it's NULL. If getpwnam(localuser) fails, we can capture it here. log.c -- instead of calling getpwnam, use slapdFrontendConfig->localuserinfo. libglobs.c -- set slapdFrontendConfig->localuserinfo in config_set_localuser, as well. slap.h -- added localuserinfo to slapdFrontendConfig_t
Ok, but please add the new member *localuserinfo as the last member of the structure, to minimize binary incompatability.
(In reply to comment #2) > Ok, but please add the new member *localuserinfo as the last member of the > structure, to minimize binary incompatability. Thanks, Rich. I've moved locauserinfo to the last of the structure. Index: slap.h =================================================================== RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/slap.h,v retrieving revision 1.24 diff -t -w -U2 -r1.24 slap.h --- slap.h 3 Aug 2007 22:14:41 -0000 1.24 +++ slap.h 17 Aug 2007 02:03:43 -0000 @@ -1926,4 +1926,7 @@ char *ldapi_search_base_dn; /* base dn to search for mapped entries */ char *ldapi_auto_dn_suffix; /* suffix to be appended to auto gen DNs */ +#ifndef _WIN32 + struct passwd *localuserinfo; /* userinfo of localuser */ +#endif /* _WIN32 */ } slapdFrontendConfig_t; And there was some mistakes in my comment in comment #1. (sorry.) > I'm moving the getpwnam to the startup time and introducing slapdFrontendConfig->localuserinfo to store the copy of ^^^^^^^^^^^^^ "struct pw", which is reused unless localuser is reset.
Created attachment 161711 [details] cvs commit message Reviewed by Rich (Thank you!) Checked in into HEAD.
Wow, thanks for the fast patch!!! It looks good to me. I will apply to our 6.21 branch and spin a patch.
You are welcome, Ulf. I've tested the change with verbose printout to the errors log with 1MB error log size limit + rsearch. Verified log lotations were made without any problems. But, I don't have the test env including the DS in name service switch handy... If possible, could you run the test and update us? Thank you in advance!!
I have tested with ldap name service configured to point to the same server, and I caused the accept loop to print messages for the specific deadlock scenario. Looks good!
Thanks a lot, Ulf! *relieved...* :)
Ulf, would you consider this bug to be verified in RHDS 8.0?
Yes, looks fixed to me.