Bug 173687

Summary: deadlock caused by error log rotation and logging
Product: Red Hat Directory Server Reporter: Noriko Hosoi <nhosoi>
Component: Directory ServerAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED NEXTRELEASE QA Contact: Orla Hegarty <ohegarty>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.2CC: nkinder, ohegarty, rmeggins, tngan
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-12-02 20:07:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 152373, 159328, 182367, 182630, 184342, 184343, 240316    
Attachments:
Description Flags
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (Directory621RtmBranch)
none
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (Directory71RtmBranch)
none
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (trunk)
none
cvs commit message
none
cvs diff config.c
none
cvs commit message none

Description Noriko Hosoi 2005-11-18 22:02:50 UTC
Description of problem:
Reported by Ulf:
Hello all.  Has anyone had a chance to think about this one?  To
summarize, when log rotation happens and it fails to open the new log,
it tries to log the fact that it was unable to open the log, and we go
into a deadlock because the same thread tries to aquire another write lock.

Ulf

Ulf Weltman wrote:


>> When a customer's connection table was full they were unable to rotate
>> the errors log, and the server is in a hung state after.  I was
>> wondering if we were somehow allowing the errors log descriptor to get
>> swiped while it's being rotated.  This turned out to not be the case,
>> and there were plenty of free descriptors in the reserved pool.
>>
>> The problem was caused by two performance related kernel patches for
>> rapid file descriptors issuing (PHKL_25994 & PHKL_25995).  The
>> resolution was to apply a patch released in February this year:
>>    PHKL_32374 (The open(2) system call might not return the lowest
>> numbered file descriptor available for a particular process.)
>>
>> Or, apply the workaround of lowering the connection tablesize.  This
>> provides for spare file descriptors at the tail end of the range and
>> avoids the problem.
>>
>> However, one issue still remains.  If the open() call fails when
>> rotating error log, the server will be in a hung state.  I think this
>> is because we try to log the fact that we were unable to open the
>> errors log.  Perhaps we should simply shut down if we're unable to do
>> errors logging?
>>
>> ... Or are there no real cases where the open() call would fail?  You
>> can duplicate the hang:
>> - install a server as yourself
>> - set errorlog rotation to 1 minute
>> - set a breakpoint in log__open_errorlogfile()
>> - perform LDAP op with schema violation like adding an entry with two
>> "displayname" values
>> - this gets logged to the errors log and if it's been a minute, it'll
>> force rotation
>> - while in breakpoint, chown the logs directory to root and continue
>> in the debugger
>> That's a hokey scenario but it's easier than finding and installing
>> the obsolete kernel patches.  :) 
>>
>> thread 1 (system thread 4720):
>> #0  0x7b00e8f8 in __ksleep+0x10 () from /usr/lib/libc.2
>> #1  0x7ab1af88 in pthread_mutex_lock+0x16c () from /usr/lib/libpthread.1
>> #2  0x7ab84d40 in PR_Lock (lock=0xa5c90) at
>> /home/ulf/src/nspr3/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:190
>> #3  0x7a7ef84c in __rwl_acquire_write_lock (rp=0xb5238) at
>> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/rwlock.c:77
>> #4  0x7a792bf0 in slapd_log_error_proc_internal (subsystem=0x0,
>> fmt=0x7a729890 "WARNING: can't open file %s. errno %d (%s)\n",
>> ap_err=0x7f7f1ad8, ap_file=0x7f7f1ad8) at
>> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:1690
>> #5  0x7a792b54 in slapd_log_error_proc (subsystem=0x0, fmt=0x7a729890
>> "WARNING: can't open file %s. errno %d (%s)\n") at
>> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:1674
>> #6  0x7a79707c in log__open_errorlogfile (logfile_state=0, locked=1)
>> at /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:3245
>> #7  0x7a792c4c in slapd_log_error_proc_internal (subsystem=0x0,
>> fmt=0x8db38 "Not listening for new connections - too many fds open\n",
>> ap_err=0x7f7f1198, ap_file=0x7f7f1198) at
>> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:1693#8 
>> 0x7a792b54 in slapd_log_error_proc (subsystem=0x0, fmt=0x8db38 "Not
>> listening for new connections - too many fds open\n") at
>> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/log.c:1674
>> #9  0x1e274 in setup_pr_read_pds (ct=0x5ded58, n_tcps=0x39d2c8,
>> s_tcps=0x0, num_to_read=0x7f7f0eac) at
>> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/daemon.c:926
>> #10 0x1d58c in slapd_daemon (ports=0x7f7f0be8) at
>> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/daemon.c:580
>> #11 0x41214 in main (argc=5, argv=0x7f7f09fc) at
>> /home/ulf/src/621/ldap/ns/netsite/ldap/servers/slapd/main.c:1066

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Noriko Hosoi 2005-11-18 22:04:00 UTC
Ulf wrote:
Further info on this, another customer ran into it and I think I think
they did it via another bug.  Here are simplified steps for both:

Bug #1, log rotation during early startup can result in wrong ownership
of logs
1) As root, install DS and set it to run as something non-root, for
example user "ulf", group "users".
2) Set errors log rotation policy to 1 minute for faster testing.
3) Wait a minute and then run "bak2db /non/existant/path".  This will
fail with the regular "..being used by another slapd process" message
since the server is running.  However, since it's time to rotate the log
it does that first, and the new log gets owned by root because it hasn't
setuid'd yet I think.

...ns-slapd doesn't show any problem at this point until it's time to
rotate again...

Bug #2, the log proc deadlocks itself if log rotation fails
4) Wait a minute and add "description: test" to an entry and then try to
add the same value again.  This will cause ns-slapd to try to log "value
already exists" but first rotate its log again, and it'll run into the
root permissions problem.
5) Try another modify that adds another "description: test" to the same
entry, the operation hangs when it tries to log the fact that it can't
rotate.
6) Try more modifies that add "description: test" to the same entry,
each of these operations will hang until there are no more worker
threads available.

Even without bug #1 there are other ways log rotation can fail, it's
happened to two customers for different reasons now.  I don't know what
is the right thing to do if rotation fails.  Continue running without
error logging?  Exit?  S.O.S. smoke signals...

Comment 2 Noriko Hosoi 2005-11-18 22:04:34 UTC
David wrote:
Do we know why we're deadlocking ? (I can see that we're waiting
on the log file lock, but I'm not sure who left it locked and why).

On reflection I think if we can't rotate the log then the server should
exit. Logging seems to me to be a necessary part of server functionality
and if we can't provide logging then we should give up executing. I think
this is better than limping along providing service but not logging. 

Comment 3 Noriko Hosoi 2005-11-18 22:05:27 UTC
Ming Lang wrote:

> It's the same thread deadlocked itself. This is
> what happened in short:


That's strange. I thought we were using reentrant locks !?!?

>
> 1. acquires a write lock to the errors log and try to
>    write some error message
> 2. the errors log needs rotate, it closes the old log
>    and opens a new log but the open fails
> 3. tries to acquire another write lock to the errors log
>    and to write the open fail errors
>


Comment 4 Noriko Hosoi 2005-11-18 22:19:10 UTC
David Boreham wrote:
>> Ming Lang wrote:
>
>>>> It's the same thread deadlocked itself. This is
>>>> what happened in short:
>>
>> That's strange. I thought we were using reentrant locks !?!?

Crawling through NSPR mutex initialization, I don't think we set any
attributes on this lock and the default behavior in that case is
undefined for recursive locking according to the man page on HP-UX.  I
just tested it, seems a thread trying to aquire a default lock twice
does deadlock.

I also think exiting might be the right thing to do if we lose error
logging, if we can record that somehow for the user.  It would be an
additional support burden if customers don't know if they crashed or
exited for this new reason.



Comment 5 Noriko Hosoi 2005-11-21 17:33:14 UTC
Created attachment 121305 [details]
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (Directory621RtmBranch)

Modified to change the owner to the "localuser" if the error log file is not
owned by the user.

Test case:
Install the DS as root and set the DS user something else (e.g., nobody)
Set error log lotation time 1 minute:
    nsslapd-errorlog-logrotationtime: 1
    nsslapd-errorlog-logrotationtimeunit: minute
Repeat some db command which causes writing into the error log.  (I ran db2bak
<path_that_does_not_exist> in the endless loop)

I ran this test case for 2.5 days and did not see the deadlock (nor errors file
owned by non-localuser):
-rw-------   1 nobody	nobody	    5330 Nov 21 09:11 logs/errors
-rw-------   1 nobody	nobody	   12712 Nov 21 09:11
logs/errors.20051121-091008
-rw-------   1 nobody	nobody	   13162 Nov 21 09:10
logs/errors.20051121-090903
-rw-------   1 nobody	nobody	   13162 Nov 21 09:08
logs/errors.20051121-090758
-rw-------   1 nobody	nobody	   12712 Nov 21 09:07
logs/errors.20051121-090655
-rw-------   1 nobody	nobody	   13053 Nov 21 09:06
logs/errors.20051121-090551
-rw-------   1 nobody	nobody	   12821 Nov 21 09:05
logs/errors.20051121-090448
-rw-------   1 nobody	nobody	   13162 Nov 21 09:04
logs/errors.20051121-090344
-rw-------   1 nobody	nobody	   13162 Nov 21 09:03
logs/errors.20051121-090239
-rw-------   1 nobody	nobody	   13162 Nov 21 09:02
logs/errors.20051121-090135
-rw-------   1 nobody	nobody	   13162 Nov 21 09:01
logs/errors.20051121-090030
-rw-------   1 nobody	nobody	   12712 Nov 21 09:00
logs/errors.20051121-085927

Comment 6 Noriko Hosoi 2005-11-23 01:18:11 UTC
Created attachment 121379 [details]
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (Directory71RtmBranch)

Comment 7 Noriko Hosoi 2005-11-23 17:53:55 UTC
Created attachment 121408 [details]
cvs diff servers/slapd/{og.c,main.c,util.c,proto-slap.h} (trunk)

Comment 8 Noriko Hosoi 2005-11-23 17:56:42 UTC
Created attachment 121409 [details]
cvs commit message

Checked in into Directory621RtmBranch, Directory71RtmBranch, and trunk.

Comment 9 Noriko Hosoi 2005-11-23 23:58:27 UTC
Created attachment 121424 [details]
cvs diff config.c

The diff is on the trunk, but I think this is common among the older branches.
In the fix made in this bug report, the onwership is changed in
log__open_errorlogfile, which is called via slapd_bootstrap_config in main.  At
that time, localuser is not set yet.  The configuration parameter is set in
setup_internal_backends, which is called after slapd_bootstrap_config.

To make the localuser info available, we have to force to set localuser in
slapd_bootstrap_config.

Comment 10 Noriko Hosoi 2005-11-24 01:38:22 UTC
Created attachment 121433 [details]
cvs commit message

Reviewed by Nathan (Thank you!)

Checked in into Directory621RtmBranch, Directory71RtmBranch, and the trunk.

Comment 11 Orla Hegarty 2005-12-02 18:19:02 UTC
Verified the log file permissions and hang bugs as listed above are fixed
against 20051125.1.qa on RHEL AS 2.1, HP-UX 11i and Solaris 9

Steps to reproduce:
1. Modify the log rotation configuration to rotate the logs every 1 minutes and
restart the server to have this change take effect.
2. Constantly add the same attribute to the server which will fail as it already
exists - this will be logged to the error log
3. Constantly attempt to do an offline backup of the server which causes a
conflicting ns-slapd process to run for a short time and fail, logging to the
error log
4. Every 3 minutes or so attempt to restart the server

I have an automated script to do this, which I will attach. I will add this to a
written test case in the Creative Destruction test plan.

What I noticed from my test plan is that the hang and permissions problems with
the error logs are fixed. The script can easily reproduce this with DS 6.21 and
it is fixed on DS 6.21 SP 2.

However I noticed a much rarer case where the permissions are ok, the client
modification does not hang - clues that the bug occured but the server is hung.
It's a slightly different case. The searches work but the server is still hung
..... a restart attempt won't shut the server down, threads are left behind. 

I saw this once. 

Comment 14 Noriko Hosoi 2006-02-21 20:16:40 UTC
This bug goes into the DS7.1 SP2 as well as DS 6.21 SP3.
(The fix was checked in on 11/23/2005.)

Comment 15 Orla Hegarty 2006-02-25 02:03:57 UTC
Verified fixed in DS 7.21 SP 2 against all 5 supported platforms ( RHEL 3,4 and
Solaris 9 - 32 bit. Solaris 9 and HP-UX 11i - 64 bit )

Test case has been backported to the 7.1 patch branch tet.



Comment 16 Noriko Hosoi 2006-03-08 02:04:51 UTC
*DOCS*
When the Directory Server's effective user and the real user are different,
there was a small window that creates log files owned by the real user, which
should be owned by the effective user.  The problem was fixed.

Comment 17 To Ngan 2006-03-10 21:37:30 UTC
Verified against DS 6.21 SP3, ran creativeD CDLV_7 on all platforms for 2 hours+.

Comment 19 To Ngan 2006-12-13 00:11:21 UTC
*** Bug 174852 has been marked as a duplicate of this bug. ***

Comment 20 Chandrasekar Kannan 2008-08-11 23:42:41 UTC
Bug already CLOSED. setting screened+ flag