Bug 766321

Summary: ns-slapd hangs in db functions under load
Product: Red Hat Enterprise Linux 6 Reporter: Dmitri Pal <dpal>
Component: 389-ds-baseAssignee: Rich Megginson <rmeggins>
Status: CLOSED WORKSFORME QA Contact: IDM QE LIST <seceng-idm-qe-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.0CC: andrey.ivanov, jgalipea, nhosoi, nkinder, rmeggins
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 742582 Environment:
Last Closed: 2012-06-27 18:11:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 742582    
Bug Blocks:    

Description Dmitri Pal 2011-12-11 17:01:24 UTC
+++ This bug was initially created as a clone of Bug #742582 +++

Created attachment 525795 [details]
Stack trace

The FreeIPA development team reported an issue with ns-slapd hanging when under load from IPA. (https://fedorahosted.org/freeipa/ticket/1885).  I am able to reproduce this issue on F15 x86_64.

The hang looks to be some sort of deadlock within libdb.  When the hang occurs, the ns-slapd process goes to 100% CPU and doesn't progress.  I built a debug version of 389-ds-base and db4, and was able to get some interesting stack traces from 3 threads that seem to contribute to this problem.  There is one thread doing a write to the memberOf index db, another thread doing a read from the memberOf index db, and the checkpoint thread attempting to do a checkpoint.  These threads all seem to be yielding or waiting for a mutex.

The db_stat tool doesn't show anything waiting for locks, so it seems to be something more internal to libdb.

--- Additional comment from nkinder on 2011-09-30 12:27:40 EDT ---

Created attachment 525797 [details]
db_stat output

--- Additional comment from nkinder on 2011-09-30 13:37:45 EDT ---

In the debugger, I see the checkpoint thread looping over this section of code in __memp_sync_int():

--------------------------------------------------------------------------
    for (i = wrote_cnt = 0, remaining = ar_cnt; remaining > 0; ++i) {
        if (i >= ar_cnt) {
            i = 0;
            __os_yield(env, 1, 0);
        }

        if ((hp = bharray[i].track_hp) == NULL)
            continue;
--------------------------------------------------------------------------

The track_hp pointer is NULL for every slot in the array that is checked, so we go through the entire array and then end up calling _os_yield() to allow other threads to do some work.  The array contents never change though, so this thread will loop endlessly.

The writer thread loops constantly over this section of code:

--------------------------------------------------------------------------
    for (nspins =
        mtxregion->stat.st_mutex_tas_spins; nspins > 0; --nspins) {
            dbenv = env->dbenv;

            /*
             * Avoid interlocked instructions until they're likely to
             * succeed by first checking whether it is held
             */
            if (MUTEXP_IS_BUSY(mutexp) || !MUTEXP_ACQUIRE(mutexp)) {
                    if (F_ISSET(dbenv, DB_ENV_FAILCHK) &&
                        ip == NULL && dbenv->is_alive(dbenv,
                        mutexp->pid, mutexp->tid, 0) == 0) {
                            ret = __env_set_state(env, &ip, THREAD_VERIFY);
                            if (ret != 0 ||
                                ip->dbth_state == THREAD_FAILCHK)
                                    return (DB_RUNRECOVERY);
                    }
                    if (nowait)
                            return (DB_LOCK_NOTGRANTED);
                    /*
                     * Some systems (notably those with newer Intel CPUs)
                     * need a small pause here. [#6975]
                     */
                    MUTEX_PAUSE
                    continue;
--------------------------------------------------------------------------

The MUTEXP_IS_BUSY() is always true, which causes us to hit the continue statement, jumping back to the top of the for loop.  When nspins reaches 0, at attempt is made to acquire the mutex, which fails, causing us to go back to the for loop again.

All of this seems to point to a mutex that is leaked somewhere.

--- Additional comment from nhosoi on 2011-09-30 16:13:49 EDT ---

(In reply to comment #1)
> Created attachment 525797 [details]
> db_stat output

A list of status string which could appear in the db_stat -C output.
		status = "ABORT";
		status = "EXPIRED";
		status = "FREE";
		status = "HELD";
		status = "PENDING";
		status = "WAIT";
		status = "UNKNOWN";

Interestingly, there's no "WAIT" nor "PENDING" in the attached output...

--- Additional comment from nhosoi on 2011-10-05 14:50:22 EDT ---

Next time this happens, could you please get the page number __memp_fget is trying to fget?

(gdb) thread 14 (of course, please replace the number with your latest...)
(gdb) frame 4 (ditto)
(gdb) p *pgnoaddr

Thread 14 (Thread 0x7f042d7f2700 (LWP 3013)):
#0  0x0000003dd00c7b57 in sched_yield () from /lib64/libc.so.6
#1  0x00007f044afea5d0 in __os_yield (env=0xec71e0, secs=0, usecs=0) at ../../os/os_yield.c:54
#2  0x00007f044ae6bf10 in __db_tas_mutex_lock_int (env=0xec71e0, mutex=2156, nowait=0)
    at ../../mutex/mut_tas.c:215
#3  0x00007f044ae6bfee in __db_tas_mutex_lock (env=0xec71e0, mutex=2156)
    at ../../mutex/mut_tas.c:248
#4  0x00007f044afcff2f in __memp_fget (dbmfp=0x7f041c011130, pgnoaddr=0x7f042d7e45b8, ip=0x0, 

Also, get the output from db_stat?  I'm curious which page the memberof modop is waiting for...

--- Additional comment from nkinder on 2011-10-07 15:48:53 EDT ---

Created attachment 526951 [details]
Test Scripts

This is a set of test scripts that can be used to reproduce the issue with a standalone 389 DS instance.

--- Additional comment from nkinder on 2011-10-07 16:22:08 EDT ---

Here is what the pages look like in the memberOf index db file:

1 (btree internal)
-->2 (btree leaf)
-->3 (btree leaf)
-->4 (btree leaf)
----->7 (btree internal)
-------->8 (duplicate)
-------->9 (duplicate)
-->5 (btree leaf)
----->6 (btree internal)
-------->10 (duplicate)
-------->11 (duplicate)

From what I can see in the debugger, the write thread has the following locks on pages:

  read locks = 4
  write locks = 4, 5, 7, 9

The read thread has the following lock:

  read lock = 11

The deadlock is happening because the read thread is requesting a read lock on page 5, which is blocked since the write thread already has a write lock on page 5.  The write thread is attempting to get a write lock on page 11, which is blocked since the read thread already holds a read lock on page 11.

What is interesting is that db_stat does not show the read lock that the read thread has on page 11.  It also does not list the write locks held by the write thread on pages 7 and 9 (though it seems to list this transaction as having 3 write locks on page 4).

It seems that there is no deadlock detection occurring within Berkeley DB for duplicate pages.

--- Additional comment from nkinder on 2011-10-07 16:25:58 EDT ---

Created attachment 526956 [details]
db_dump output of memberOf.db4

This is the dbdump output of the memberOf.db4 index file.  It shows the page layout.

--- Additional comment from nkinder on 2011-10-07 16:49:31 EDT ---

I have posted a question about this issue on the Oracle Berkeley DB forums, as it looks more and more to be a problem inside of libdb:

https://forums.oracle.com/forums/thread.jspa?threadID=2292442

Comment 3 Nathan Kinder 2012-06-27 18:11:44 UTC
We have been unable to reproduce this on RHEL, so closing this as WORKSFORME.