Bug 218076

Summary: Server autoconf build crashes when stopped/started very quickly
Product: [Retired] 389 Reporter: Rich Megginson <rmeggins>
Component: Database - GeneralAssignee: Rich Megginson <rmeggins>
Status: CLOSED CURRENTRELEASE QA Contact: Viktor Ashirov <vashirov>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0.4CC: nhosoi
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-07 17:17:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 152373, 240316, 427409    
Attachments:
Description Flags
diffs
none
cvs commit log none

Description Rich Megginson 2006-12-01 18:11:05 UTC
During testing, I noticed that under certain conditions, the server would crash
during shutdown.  I was able to narrow it down to the server being started and
stopped very quickly.  The log messages would show that all of the database
threads had stopped, but the final "slapd stopped." message was not printed. 
The core file showed that the server had the main thread, the time thread, and
the checkpoint_threadmain thread, and the core dump had occurred deep in the
libdb code during the checkpoint operation, due to the dbenv and the private
structures being NULL.

It seems that the dblayer_pre_close thinks all of the database threads have
stopped, but the checkpoint thread is still running.

This problem only seems to occur with the autoconf style build on RHEL4 x86_64,
which uses the system libdb-4.2 instead of the privately built one.

Comment 1 Rich Megginson 2006-12-01 18:32:22 UTC
Created attachment 142601 [details]
diffs

Comment 2 Noriko Hosoi 2006-12-01 19:47:21 UTC
Hi Rich,

Your problem description makes sense.  Please help me understanding your fixes.

1) dblayer_pre_close first gets PR_Lock(thread_count_lock) to set
dblayer_stop_threads = 1, then goes into a while loop.
2) The other threads see the dblayer_stop_threads flag and get out from their
main loop.  And to decrease the dblayer_thread_count, they try to acquire
thread_count_lock (DECR_THREAD_COUNT).  But since dblayer_pre_close already has
it, they have to wait there.
3) PR_WaitCondVar in dblayer_pre_close times out and "break" from the while loop
and releases the thread_count_lock, then prints out the "Timeout" message and
returns.
4) The other threads acquire thread_count_lock one by one and decreases the
dblayer_thread_count and notifies.

This scenario looks wrong. :)  Could you please tell me how this is avoided?

Comment 3 Rich Megginson 2006-12-01 20:12:53 UTC
(In reply to comment #2)
> Hi Rich,
> 
> Your problem description makes sense.  Please help me understanding your fixes.
> 
> 1) dblayer_pre_close first gets PR_Lock(thread_count_lock) to set
> dblayer_stop_threads = 1, then goes into a while loop.
> 2) The other threads see the dblayer_stop_threads flag and get out from their
> main loop.  And to decrease the dblayer_thread_count, they try to acquire
> thread_count_lock (DECR_THREAD_COUNT).  But since dblayer_pre_close already has
> it, they have to wait there.

Not exactly.  PR_WaitCondVar releases the lock and reacquires it when it
returns.  So the other threads are able to acquire the lock, update the thread
count, and release the lock while PR_WaitCondVar waits.  This is the usual
semantics of waiting on a condition variable.

> 3) PR_WaitCondVar in dblayer_pre_close times out and "break" from the while loop
> and releases the thread_count_lock, then prints out the "Timeout" message and
> returns.
> 4) The other threads acquire thread_count_lock one by one and decreases the
> dblayer_thread_count and notifies.
> 
> This scenario looks wrong. :)  Could you please tell me how this is avoided?

Thanks to the way WaitCondVar works, this won't happen.  What will happen is
that each thread will acquire the lock, decrement the thread count, and release
the lock.  When the last thread is finished, and sets the thread count == 0, it
will NotifyCondVar, which will wake up WaitCondVar as soon as the lock is released.

Comment 4 Noriko Hosoi 2006-12-01 21:24:48 UTC
Thanks for the explanation.  Approved!

Comment 5 Rich Megginson 2006-12-02 00:20:55 UTC
Created attachment 142642 [details]
cvs commit log

Reviewed by: nhosoi (Thanks!)
Branch: HEAD
Files: see diffs
Fix Description: The problem was that 3 of the database threads were being
started and stopped before the 4th had a chance to start.  So the thread count
would start at 3 and drop to 0, and the dblayer_pre_close code would think
everything was fine.  The 4th thread is the checkpoint thread which was doing a
db_checkpoint operation before incrementing the thread count.  For some reason,
on x86_64 with the system provided libdb-4.2, the checkpoint operation was
taking longer than it usually does with our locally built libdb-4.2, so this
allowed the other 3 threads to stop and start before the checkpoint thread had
a chance to increment the thread count.
The solution is to make sure the incrementing of the thread count occurs as
early as possible in the thread function, before any executable code that might
take any time.	This should ensure that all of the threads start up and
increment the thread count before the shutdown occurs.
The second part of the solution is that, according to wtc, the NSPR maintainer,
the PR_Atomic functions should not be used as a semaphore like this.  So, the
code was rewritten to use locks and condition variables.  The code is not
performance critical, so adding locking should not have any impact on
performance.  In addition, the new code is much cleaner, more correct, and more
obvious about what it's doing.
Platforms tested: RHEL4 x86_64
Flag Day: no
Doc impact: no