Bug 742582

Summary: ns-slapd hangs in db functions under load
Product: [Retired] 389 Reporter: Nathan Kinder <nkinder>
Component: Directory ServerAssignee: Nathan Kinder <nkinder>
Status: CLOSED WORKSFORME QA Contact: Chandrasekar Kannan <ckannan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 1.2.9CC: andrey.ivanov, benl, nhosoi, rmeggins
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 766321 (view as bug list) Environment:
Last Closed: 2012-03-28 18:07:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 743970, 766321    
Attachments:
Description Flags
Stack trace
none
db_stat output
none
Test Scripts
none
db_dump output of memberOf.db4
none
Updated test scripts none

Description Nathan Kinder 2011-09-30 16:01:01 UTC
Created attachment 525795 [details]
Stack trace

The FreeIPA development team reported an issue with ns-slapd hanging when under load from IPA. (https://fedorahosted.org/freeipa/ticket/1885).  I am able to reproduce this issue on F15 x86_64.

The hang looks to be some sort of deadlock within libdb.  When the hang occurs, the ns-slapd process goes to 100% CPU and doesn't progress.  I built a debug version of 389-ds-base and db4, and was able to get some interesting stack traces from 3 threads that seem to contribute to this problem.  There is one thread doing a write to the memberOf index db, another thread doing a read from the memberOf index db, and the checkpoint thread attempting to do a checkpoint.  These threads all seem to be yielding or waiting for a mutex.

The db_stat tool doesn't show anything waiting for locks, so it seems to be something more internal to libdb.

Comment 1 Nathan Kinder 2011-09-30 16:27:40 UTC
Created attachment 525797 [details]
db_stat output

Comment 2 Nathan Kinder 2011-09-30 17:37:45 UTC
In the debugger, I see the checkpoint thread looping over this section of code in __memp_sync_int():

--------------------------------------------------------------------------
    for (i = wrote_cnt = 0, remaining = ar_cnt; remaining > 0; ++i) {
        if (i >= ar_cnt) {
            i = 0;
            __os_yield(env, 1, 0);
        }

        if ((hp = bharray[i].track_hp) == NULL)
            continue;
--------------------------------------------------------------------------

The track_hp pointer is NULL for every slot in the array that is checked, so we go through the entire array and then end up calling _os_yield() to allow other threads to do some work.  The array contents never change though, so this thread will loop endlessly.

The writer thread loops constantly over this section of code:

--------------------------------------------------------------------------
    for (nspins =
        mtxregion->stat.st_mutex_tas_spins; nspins > 0; --nspins) {
            dbenv = env->dbenv;

            /*
             * Avoid interlocked instructions until they're likely to
             * succeed by first checking whether it is held
             */
            if (MUTEXP_IS_BUSY(mutexp) || !MUTEXP_ACQUIRE(mutexp)) {
                    if (F_ISSET(dbenv, DB_ENV_FAILCHK) &&
                        ip == NULL && dbenv->is_alive(dbenv,
                        mutexp->pid, mutexp->tid, 0) == 0) {
                            ret = __env_set_state(env, &ip, THREAD_VERIFY);
                            if (ret != 0 ||
                                ip->dbth_state == THREAD_FAILCHK)
                                    return (DB_RUNRECOVERY);
                    }
                    if (nowait)
                            return (DB_LOCK_NOTGRANTED);
                    /*
                     * Some systems (notably those with newer Intel CPUs)
                     * need a small pause here. [#6975]
                     */
                    MUTEX_PAUSE
                    continue;
--------------------------------------------------------------------------

The MUTEXP_IS_BUSY() is always true, which causes us to hit the continue statement, jumping back to the top of the for loop.  When nspins reaches 0, at attempt is made to acquire the mutex, which fails, causing us to go back to the for loop again.

All of this seems to point to a mutex that is leaked somewhere.

Comment 3 Noriko Hosoi 2011-09-30 20:13:49 UTC
(In reply to comment #1)
> Created attachment 525797 [details]
> db_stat output

A list of status string which could appear in the db_stat -C output.
		status = "ABORT";
		status = "EXPIRED";
		status = "FREE";
		status = "HELD";
		status = "PENDING";
		status = "WAIT";
		status = "UNKNOWN";

Interestingly, there's no "WAIT" nor "PENDING" in the attached output...

Comment 4 Noriko Hosoi 2011-10-05 18:50:22 UTC
Next time this happens, could you please get the page number __memp_fget is trying to fget?

(gdb) thread 14 (of course, please replace the number with your latest...)
(gdb) frame 4 (ditto)
(gdb) p *pgnoaddr

Thread 14 (Thread 0x7f042d7f2700 (LWP 3013)):
#0  0x0000003dd00c7b57 in sched_yield () from /lib64/libc.so.6
#1  0x00007f044afea5d0 in __os_yield (env=0xec71e0, secs=0, usecs=0) at ../../os/os_yield.c:54
#2  0x00007f044ae6bf10 in __db_tas_mutex_lock_int (env=0xec71e0, mutex=2156, nowait=0)
    at ../../mutex/mut_tas.c:215
#3  0x00007f044ae6bfee in __db_tas_mutex_lock (env=0xec71e0, mutex=2156)
    at ../../mutex/mut_tas.c:248
#4  0x00007f044afcff2f in __memp_fget (dbmfp=0x7f041c011130, pgnoaddr=0x7f042d7e45b8, ip=0x0, 

Also, get the output from db_stat?  I'm curious which page the memberof modop is waiting for...

Comment 5 Nathan Kinder 2011-10-07 19:48:53 UTC
Created attachment 526951 [details]
Test Scripts

This is a set of test scripts that can be used to reproduce the issue with a standalone 389 DS instance.

Comment 6 Nathan Kinder 2011-10-07 20:22:08 UTC
Here is what the pages look like in the memberOf index db file:

1 (btree internal)
-->2 (btree leaf)
-->3 (btree leaf)
-->4 (btree leaf)
----->7 (btree internal)
-------->8 (duplicate)
-------->9 (duplicate)
-->5 (btree leaf)
----->6 (btree internal)
-------->10 (duplicate)
-------->11 (duplicate)

From what I can see in the debugger, the write thread has the following locks on pages:

  read locks = 4
  write locks = 4, 5, 7, 9

The read thread has the following lock:

  read lock = 11

The deadlock is happening because the read thread is requesting a read lock on page 5, which is blocked since the write thread already has a write lock on page 5.  The write thread is attempting to get a write lock on page 11, which is blocked since the read thread already holds a read lock on page 11.

What is interesting is that db_stat does not show the read lock that the read thread has on page 11.  It also does not list the write locks held by the write thread on pages 7 and 9 (though it seems to list this transaction as having 3 write locks on page 4).

It seems that there is no deadlock detection occurring within Berkeley DB for duplicate pages.

Comment 7 Nathan Kinder 2011-10-07 20:25:58 UTC
Created attachment 526956 [details]
db_dump output of memberOf.db4

This is the dbdump output of the memberOf.db4 index file.  It shows the page layout.

Comment 8 Nathan Kinder 2011-10-07 20:49:31 UTC
I have posted a question about this issue on the Oracle Berkeley DB forums, as it looks more and more to be a problem inside of libdb:

https://forums.oracle.com/forums/thread.jspa?threadID=2292442

Comment 9 Martin Kosek 2012-01-04 13:20:38 UTC
Upstream ticket:
https://fedorahosted.org/389/ticket/25

Comment 10 Rich Megginson 2012-01-10 17:37:14 UTC
marking as screened because it has been cloned upstream

Comment 11 Nathan Kinder 2012-02-15 19:29:22 UTC
Created attachment 562317 [details]
Updated test scripts

This is an update to the previous test scripts that can be used to reproduce the issue.

Since the issue is timing related, I've found that it sometimes takes multiple test runs to reproduce the issue.  This new set of scripts automatically runs the test multiple times to attempt to reproduce the issue.  The DS instance creation and cleanup is now automatically done by the scripts at the beginning and end of each test run.  See the README file included in the test script tarball for details.

Comment 12 Noriko Hosoi 2012-03-28 18:07:37 UTC
Changing the status to WORKSFORME.

See also:
https://fedorahosted.org/389/ticket/25