Bug 218076
Summary: | Server autoconf build crashes when stopped/started very quickly | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Rich Megginson <rmeggins> | ||||||
Component: | Database - General | Assignee: | Rich Megginson <rmeggins> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Viktor Ashirov <vashirov> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 1.0.4 | CC: | nhosoi | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-12-07 17:17:22 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 152373, 240316, 427409 | ||||||||
Attachments: |
|
Description
Rich Megginson
2006-12-01 18:11:05 UTC
Created attachment 142601 [details]
diffs
Hi Rich, Your problem description makes sense. Please help me understanding your fixes. 1) dblayer_pre_close first gets PR_Lock(thread_count_lock) to set dblayer_stop_threads = 1, then goes into a while loop. 2) The other threads see the dblayer_stop_threads flag and get out from their main loop. And to decrease the dblayer_thread_count, they try to acquire thread_count_lock (DECR_THREAD_COUNT). But since dblayer_pre_close already has it, they have to wait there. 3) PR_WaitCondVar in dblayer_pre_close times out and "break" from the while loop and releases the thread_count_lock, then prints out the "Timeout" message and returns. 4) The other threads acquire thread_count_lock one by one and decreases the dblayer_thread_count and notifies. This scenario looks wrong. :) Could you please tell me how this is avoided? (In reply to comment #2) > Hi Rich, > > Your problem description makes sense. Please help me understanding your fixes. > > 1) dblayer_pre_close first gets PR_Lock(thread_count_lock) to set > dblayer_stop_threads = 1, then goes into a while loop. > 2) The other threads see the dblayer_stop_threads flag and get out from their > main loop. And to decrease the dblayer_thread_count, they try to acquire > thread_count_lock (DECR_THREAD_COUNT). But since dblayer_pre_close already has > it, they have to wait there. Not exactly. PR_WaitCondVar releases the lock and reacquires it when it returns. So the other threads are able to acquire the lock, update the thread count, and release the lock while PR_WaitCondVar waits. This is the usual semantics of waiting on a condition variable. > 3) PR_WaitCondVar in dblayer_pre_close times out and "break" from the while loop > and releases the thread_count_lock, then prints out the "Timeout" message and > returns. > 4) The other threads acquire thread_count_lock one by one and decreases the > dblayer_thread_count and notifies. > > This scenario looks wrong. :) Could you please tell me how this is avoided? Thanks to the way WaitCondVar works, this won't happen. What will happen is that each thread will acquire the lock, decrement the thread count, and release the lock. When the last thread is finished, and sets the thread count == 0, it will NotifyCondVar, which will wake up WaitCondVar as soon as the lock is released. Thanks for the explanation. Approved! Created attachment 142642 [details]
cvs commit log
Reviewed by: nhosoi (Thanks!)
Branch: HEAD
Files: see diffs
Fix Description: The problem was that 3 of the database threads were being
started and stopped before the 4th had a chance to start. So the thread count
would start at 3 and drop to 0, and the dblayer_pre_close code would think
everything was fine. The 4th thread is the checkpoint thread which was doing a
db_checkpoint operation before incrementing the thread count. For some reason,
on x86_64 with the system provided libdb-4.2, the checkpoint operation was
taking longer than it usually does with our locally built libdb-4.2, so this
allowed the other 3 threads to stop and start before the checkpoint thread had
a chance to increment the thread count.
The solution is to make sure the incrementing of the thread count occurs as
early as possible in the thread function, before any executable code that might
take any time. This should ensure that all of the threads start up and
increment the thread count before the shutdown occurs.
The second part of the solution is that, according to wtc, the NSPR maintainer,
the PR_Atomic functions should not be used as a semaphore like this. So, the
code was rewritten to use locks and condition variables. The code is not
performance critical, so adding locking should not have any impact on
performance. In addition, the new code is much cleaner, more correct, and more
obvious about what it's doing.
Platforms tested: RHEL4 x86_64
Flag Day: no
Doc impact: no
|