Bug 173687
Summary: | deadlock caused by error log rotation and logging | ||
---|---|---|---|
Product: | Red Hat Directory Server | Reporter: | Noriko Hosoi <nhosoi> |
Component: | Directory Server | Assignee: | Noriko Hosoi <nhosoi> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Orla Hegarty <ohegarty> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 7.2 | CC: | nkinder, ohegarty, rmeggins, tngan |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-12-02 20:07:53 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 152373, 159328, 182367, 182630, 184342, 184343, 240316 | ||
Attachments: |
Ulf wrote: Further info on this, another customer ran into it and I think I think they did it via another bug. Here are simplified steps for both: Bug #1, log rotation during early startup can result in wrong ownership of logs 1) As root, install DS and set it to run as something non-root, for example user "ulf", group "users". 2) Set errors log rotation policy to 1 minute for faster testing. 3) Wait a minute and then run "bak2db /non/existant/path". This will fail with the regular "..being used by another slapd process" message since the server is running. However, since it's time to rotate the log it does that first, and the new log gets owned by root because it hasn't setuid'd yet I think. ...ns-slapd doesn't show any problem at this point until it's time to rotate again... Bug #2, the log proc deadlocks itself if log rotation fails 4) Wait a minute and add "description: test" to an entry and then try to add the same value again. This will cause ns-slapd to try to log "value already exists" but first rotate its log again, and it'll run into the root permissions problem. 5) Try another modify that adds another "description: test" to the same entry, the operation hangs when it tries to log the fact that it can't rotate. 6) Try more modifies that add "description: test" to the same entry, each of these operations will hang until there are no more worker threads available. Even without bug #1 there are other ways log rotation can fail, it's happened to two customers for different reasons now. I don't know what is the right thing to do if rotation fails. Continue running without error logging? Exit? S.O.S. smoke signals... David wrote: Do we know why we're deadlocking ? (I can see that we're waiting on the log file lock, but I'm not sure who left it locked and why). On reflection I think if we can't rotate the log then the server should exit. Logging seems to me to be a necessary part of server functionality and if we can't provide logging then we should give up executing. I think this is better than limping along providing service but not logging. Ming Lang wrote: > It's the same thread deadlocked itself. This is > what happened in short: That's strange. I thought we were using reentrant locks !?!? > > 1. acquires a write lock to the errors log and try to > write some error message > 2. the errors log needs rotate, it closes the old log > and opens a new log but the open fails > 3. tries to acquire another write lock to the errors log > and to write the open fail errors > David Boreham wrote: >> Ming Lang wrote: > >>>> It's the same thread deadlocked itself. This is >>>> what happened in short: >> >> That's strange. I thought we were using reentrant locks !?!? Crawling through NSPR mutex initialization, I don't think we set any attributes on this lock and the default behavior in that case is undefined for recursive locking according to the man page on HP-UX. I just tested it, seems a thread trying to aquire a default lock twice does deadlock. I also think exiting might be the right thing to do if we lose error logging, if we can record that somehow for the user. It would be an additional support burden if customers don't know if they crashed or exited for this new reason. Created attachment 121305 [details]
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (Directory621RtmBranch)
Modified to change the owner to the "localuser" if the error log file is not
owned by the user.
Test case:
Install the DS as root and set the DS user something else (e.g., nobody)
Set error log lotation time 1 minute:
nsslapd-errorlog-logrotationtime: 1
nsslapd-errorlog-logrotationtimeunit: minute
Repeat some db command which causes writing into the error log. (I ran db2bak
<path_that_does_not_exist> in the endless loop)
I ran this test case for 2.5 days and did not see the deadlock (nor errors file
owned by non-localuser):
-rw------- 1 nobody nobody 5330 Nov 21 09:11 logs/errors
-rw------- 1 nobody nobody 12712 Nov 21 09:11
logs/errors.20051121-091008
-rw------- 1 nobody nobody 13162 Nov 21 09:10
logs/errors.20051121-090903
-rw------- 1 nobody nobody 13162 Nov 21 09:08
logs/errors.20051121-090758
-rw------- 1 nobody nobody 12712 Nov 21 09:07
logs/errors.20051121-090655
-rw------- 1 nobody nobody 13053 Nov 21 09:06
logs/errors.20051121-090551
-rw------- 1 nobody nobody 12821 Nov 21 09:05
logs/errors.20051121-090448
-rw------- 1 nobody nobody 13162 Nov 21 09:04
logs/errors.20051121-090344
-rw------- 1 nobody nobody 13162 Nov 21 09:03
logs/errors.20051121-090239
-rw------- 1 nobody nobody 13162 Nov 21 09:02
logs/errors.20051121-090135
-rw------- 1 nobody nobody 13162 Nov 21 09:01
logs/errors.20051121-090030
-rw------- 1 nobody nobody 12712 Nov 21 09:00
logs/errors.20051121-085927
Created attachment 121379 [details]
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (Directory71RtmBranch)
Created attachment 121408 [details]
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (trunk)
Created attachment 121409 [details]
cvs commit message
Checked in into Directory621RtmBranch, Directory71RtmBranch, and trunk.
Created attachment 121424 [details]
cvs diff config.c
The diff is on the trunk, but I think this is common among the older branches.
In the fix made in this bug report, the onwership is changed in
log__open_errorlogfile, which is called via slapd_bootstrap_config in main. At
that time, localuser is not set yet. The configuration parameter is set in
setup_internal_backends, which is called after slapd_bootstrap_config.
To make the localuser info available, we have to force to set localuser in
slapd_bootstrap_config.
Created attachment 121433 [details]
cvs commit message
Reviewed by Nathan (Thank you!)
Checked in into Directory621RtmBranch, Directory71RtmBranch, and the trunk.
Verified the log file permissions and hang bugs as listed above are fixed against 20051125.1.qa on RHEL AS 2.1, HP-UX 11i and Solaris 9 Steps to reproduce: 1. Modify the log rotation configuration to rotate the logs every 1 minutes and restart the server to have this change take effect. 2. Constantly add the same attribute to the server which will fail as it already exists - this will be logged to the error log 3. Constantly attempt to do an offline backup of the server which causes a conflicting ns-slapd process to run for a short time and fail, logging to the error log 4. Every 3 minutes or so attempt to restart the server I have an automated script to do this, which I will attach. I will add this to a written test case in the Creative Destruction test plan. What I noticed from my test plan is that the hang and permissions problems with the error logs are fixed. The script can easily reproduce this with DS 6.21 and it is fixed on DS 6.21 SP 2. However I noticed a much rarer case where the permissions are ok, the client modification does not hang - clues that the bug occured but the server is hung. It's a slightly different case. The searches work but the server is still hung ..... a restart attempt won't shut the server down, threads are left behind. I saw this once. This bug goes into the DS7.1 SP2 as well as DS 6.21 SP3. (The fix was checked in on 11/23/2005.) Verified fixed in DS 7.21 SP 2 against all 5 supported platforms ( RHEL 3,4 and Solaris 9 - 32 bit. Solaris 9 and HP-UX 11i - 64 bit ) Test case has been backported to the 7.1 patch branch tet. *DOCS* When the Directory Server's effective user and the real user are different, there was a small window that creates log files owned by the real user, which should be owned by the effective user. The problem was fixed. Verified against DS 6.21 SP3, ran creativeD CDLV_7 on all platforms for 2 hours+. *** Bug 174852 has been marked as a duplicate of this bug. *** Bug already CLOSED. setting screened+ flag |
Description of problem: Reported by Ulf: Hello all. Has anyone had a chance to think about this one? To summarize, when log rotation happens and it fails to open the new log, it tries to log the fact that it was unable to open the log, and we go into a deadlock because the same thread tries to aquire another write lock. Ulf Ulf Weltman wrote: >> When a customer's connection table was full they were unable to rotate >> the errors log, and the server is in a hung state after. I was >> wondering if we were somehow allowing the errors log descriptor to get >> swiped while it's being rotated. This turned out to not be the case, >> and there were plenty of free descriptors in the reserved pool. >> >> The problem was caused by two performance related kernel patches for >> rapid file descriptors issuing (PHKL_25994 & PHKL_25995). The >> resolution was to apply a patch released in February this year: >> PHKL_32374 (The open(2) system call might not return the lowest >> numbered file descriptor available for a particular process.) >> >> Or, apply the workaround of lowering the connection tablesize. This >> provides for spare file descriptors at the tail end of the range and >> avoids the problem. >> >> However, one issue still remains. If the open() call fails when >> rotating error log, the server will be in a hung state. I think this >> is because we try to log the fact that we were unable to open the >> errors log. Perhaps we should simply shut down if we're unable to do >> errors logging? >> >> ... Or are there no real cases where the open() call would fail? You >> can duplicate the hang: >> - install a server as yourself >> - set errorlog rotation to 1 minute >> - set a breakpoint in log__open_errorlogfile() >> - perform LDAP op with schema violation like adding an entry with two >> "displayname" values >> - this gets logged to the errors log and if it's been a minute, it'll >> force rotation >> - while in breakpoint, chown the logs directory to root and continue >> in the debugger >> That's a hokey scenario but it's easier than finding and installing >> the obsolete kernel patches. :) >> >> thread 1 (system thread 4720): >> #0 0x7b00e8f8 in __ksleep+0x10 () from /usr/lib/libc.2 >> #1 0x7ab1af88 in pthread_mutex_lock+0x16c () from /usr/lib/libpthread.1 >> #2 0x7ab84d40 in PR_Lock (lock=0xa5c90) at >> /home/ulf/src/nspr3/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:190 >> #3 0x7a7ef84c in __rwl_acquire_write_lock (rp=0xb5238) at >> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/rwlock.c:77 >> #4 0x7a792bf0 in slapd_log_error_proc_internal (subsystem=0x0, >> fmt=0x7a729890 "WARNING: can't open file %s. errno %d (%s)\n", >> ap_err=0x7f7f1ad8, ap_file=0x7f7f1ad8) at >> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:1690 >> #5 0x7a792b54 in slapd_log_error_proc (subsystem=0x0, fmt=0x7a729890 >> "WARNING: can't open file %s. errno %d (%s)\n") at >> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:1674 >> #6 0x7a79707c in log__open_errorlogfile (logfile_state=0, locked=1) >> at /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:3245 >> #7 0x7a792c4c in slapd_log_error_proc_internal (subsystem=0x0, >> fmt=0x8db38 "Not listening for new connections - too many fds open\n", >> ap_err=0x7f7f1198, ap_file=0x7f7f1198) at >> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:1693#8 >> 0x7a792b54 in slapd_log_error_proc (subsystem=0x0, fmt=0x8db38 "Not >> listening for new connections - too many fds open\n") at >> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:1674 >> #9 0x1e274 in setup_pr_read_pds (ct=0x5ded58, n_tcps=0x39d2c8, >> s_tcps=0x0, num_to_read=0x7f7f0eac) at >> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/daemon.c:926 >> #10 0x1d58c in slapd_daemon (ports=0x7f7f0be8) at >> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/daemon.c:580 >> #11 0x41214 in main (argc=5, argv=0x7f7f09fc) at >> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/main.c:1066 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: