218076 – Server autoconf build crashes when stopped/started very quickly

Bug 218076 - Server autoconf build crashes when stopped/started very quickly

Summary: Server autoconf build crashes when stopped/started very quickly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	389
Classification:	Retired
Component:	Database - General
Sub Component:
Version:	1.0.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Rich Megginson
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	152373 240316 FDS1.1.0
TreeView+	depends on / blocked

Reported:	2006-12-01 18:11 UTC by Rich Megginson
Modified:	2015-12-07 17:17 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-07 17:17:22 UTC
Embargoed:

Attachments	(Terms of Use)
diffs (18.53 KB, patch) 2006-12-01 18:32 UTC, Rich Megginson	no flags	Details \| Diff
cvs commit log (382 bytes, text/plain) 2006-12-02 00:20 UTC, Rich Megginson	no flags	Details
View All

Description Rich Megginson 2006-12-01 18:11:05 UTC

During testing, I noticed that under certain conditions, the server would crash
during shutdown.  I was able to narrow it down to the server being started and
stopped very quickly.  The log messages would show that all of the database
threads had stopped, but the final "slapd stopped." message was not printed. 
The core file showed that the server had the main thread, the time thread, and
the checkpoint_threadmain thread, and the core dump had occurred deep in the
libdb code during the checkpoint operation, due to the dbenv and the private
structures being NULL.

It seems that the dblayer_pre_close thinks all of the database threads have
stopped, but the checkpoint thread is still running.

This problem only seems to occur with the autoconf style build on RHEL4 x86_64,
which uses the system libdb-4.2 instead of the privately built one.

Comment 1 Rich Megginson 2006-12-01 18:32:22 UTC

Created attachment 142601 [details]
diffs

Comment 2 Noriko Hosoi 2006-12-01 19:47:21 UTC

Hi Rich,

Your problem description makes sense.  Please help me understanding your fixes.

1) dblayer_pre_close first gets PR_Lock(thread_count_lock) to set
dblayer_stop_threads = 1, then goes into a while loop.
2) The other threads see the dblayer_stop_threads flag and get out from their
main loop.  And to decrease the dblayer_thread_count, they try to acquire
thread_count_lock (DECR_THREAD_COUNT).  But since dblayer_pre_close already has
it, they have to wait there.
3) PR_WaitCondVar in dblayer_pre_close times out and "break" from the while loop
and releases the thread_count_lock, then prints out the "Timeout" message and
returns.
4) The other threads acquire thread_count_lock one by one and decreases the
dblayer_thread_count and notifies.

This scenario looks wrong. :)  Could you please tell me how this is avoided?

Comment 3 Rich Megginson 2006-12-01 20:12:53 UTC

(In reply to comment #2)
> Hi Rich,
> 
> Your problem description makes sense.  Please help me understanding your fixes.
> 
> 1) dblayer_pre_close first gets PR_Lock(thread_count_lock) to set
> dblayer_stop_threads = 1, then goes into a while loop.
> 2) The other threads see the dblayer_stop_threads flag and get out from their
> main loop.  And to decrease the dblayer_thread_count, they try to acquire
> thread_count_lock (DECR_THREAD_COUNT).  But since dblayer_pre_close already has
> it, they have to wait there.

Not exactly.  PR_WaitCondVar releases the lock and reacquires it when it
returns.  So the other threads are able to acquire the lock, update the thread
count, and release the lock while PR_WaitCondVar waits.  This is the usual
semantics of waiting on a condition variable.

> 3) PR_WaitCondVar in dblayer_pre_close times out and "break" from the while loop
> and releases the thread_count_lock, then prints out the "Timeout" message and
> returns.
> 4) The other threads acquire thread_count_lock one by one and decreases the
> dblayer_thread_count and notifies.
> 
> This scenario looks wrong. :)  Could you please tell me how this is avoided?

Thanks to the way WaitCondVar works, this won't happen.  What will happen is
that each thread will acquire the lock, decrement the thread count, and release
the lock.  When the last thread is finished, and sets the thread count == 0, it
will NotifyCondVar, which will wake up WaitCondVar as soon as the lock is released.

Comment 4 Noriko Hosoi 2006-12-01 21:24:48 UTC

Thanks for the explanation.  Approved!

Comment 5 Rich Megginson 2006-12-02 00:20:55 UTC

Created attachment 142642 [details]
cvs commit log

Reviewed by: nhosoi (Thanks!)
Branch: HEAD
Files: see diffs
Fix Description: The problem was that 3 of the database threads were being
started and stopped before the 4th had a chance to start.  So the thread count
would start at 3 and drop to 0, and the dblayer_pre_close code would think
everything was fine.  The 4th thread is the checkpoint thread which was doing a
db_checkpoint operation before incrementing the thread count.  For some reason,
on x86_64 with the system provided libdb-4.2, the checkpoint operation was
taking longer than it usually does with our locally built libdb-4.2, so this
allowed the other 3 threads to stop and start before the checkpoint thread had
a chance to increment the thread count.
The solution is to make sure the incrementing of the thread count occurs as
early as possible in the thread function, before any executable code that might
take any time.	This should ensure that all of the threads start up and
increment the thread count before the shutdown occurs.
The second part of the solution is that, according to wtc, the NSPR maintainer,
the PR_Atomic functions should not be used as a semaphore like this.  So, the
code was rewritten to use locks and condition variables.  The code is not
performance critical, so adding locking should not have any impact on
performance.  In addition, the new code is much cleaner, more correct, and more
obvious about what it's doing.
Platforms tested: RHEL4 x86_64
Flag Day: no
Doc impact: no

Note You need to log in before you can comment on or make changes to this bug.