Created attachment 525795 [details] Stack trace The FreeIPA development team reported an issue with ns-slapd hanging when under load from IPA. (https://fedorahosted.org/freeipa/ticket/1885). I am able to reproduce this issue on F15 x86_64. The hang looks to be some sort of deadlock within libdb. When the hang occurs, the ns-slapd process goes to 100% CPU and doesn't progress. I built a debug version of 389-ds-base and db4, and was able to get some interesting stack traces from 3 threads that seem to contribute to this problem. There is one thread doing a write to the memberOf index db, another thread doing a read from the memberOf index db, and the checkpoint thread attempting to do a checkpoint. These threads all seem to be yielding or waiting for a mutex. The db_stat tool doesn't show anything waiting for locks, so it seems to be something more internal to libdb.
Created attachment 525797 [details] db_stat output
In the debugger, I see the checkpoint thread looping over this section of code in __memp_sync_int(): -------------------------------------------------------------------------- for (i = wrote_cnt = 0, remaining = ar_cnt; remaining > 0; ++i) { if (i >= ar_cnt) { i = 0; __os_yield(env, 1, 0); } if ((hp = bharray[i].track_hp) == NULL) continue; -------------------------------------------------------------------------- The track_hp pointer is NULL for every slot in the array that is checked, so we go through the entire array and then end up calling _os_yield() to allow other threads to do some work. The array contents never change though, so this thread will loop endlessly. The writer thread loops constantly over this section of code: -------------------------------------------------------------------------- for (nspins = mtxregion->stat.st_mutex_tas_spins; nspins > 0; --nspins) { dbenv = env->dbenv; /* * Avoid interlocked instructions until they're likely to * succeed by first checking whether it is held */ if (MUTEXP_IS_BUSY(mutexp) || !MUTEXP_ACQUIRE(mutexp)) { if (F_ISSET(dbenv, DB_ENV_FAILCHK) && ip == NULL && dbenv->is_alive(dbenv, mutexp->pid, mutexp->tid, 0) == 0) { ret = __env_set_state(env, &ip, THREAD_VERIFY); if (ret != 0 || ip->dbth_state == THREAD_FAILCHK) return (DB_RUNRECOVERY); } if (nowait) return (DB_LOCK_NOTGRANTED); /* * Some systems (notably those with newer Intel CPUs) * need a small pause here. [#6975] */ MUTEX_PAUSE continue; -------------------------------------------------------------------------- The MUTEXP_IS_BUSY() is always true, which causes us to hit the continue statement, jumping back to the top of the for loop. When nspins reaches 0, at attempt is made to acquire the mutex, which fails, causing us to go back to the for loop again. All of this seems to point to a mutex that is leaked somewhere.
(In reply to comment #1) > Created attachment 525797 [details] > db_stat output A list of status string which could appear in the db_stat -C output. status = "ABORT"; status = "EXPIRED"; status = "FREE"; status = "HELD"; status = "PENDING"; status = "WAIT"; status = "UNKNOWN"; Interestingly, there's no "WAIT" nor "PENDING" in the attached output...
Next time this happens, could you please get the page number __memp_fget is trying to fget? (gdb) thread 14 (of course, please replace the number with your latest...) (gdb) frame 4 (ditto) (gdb) p *pgnoaddr Thread 14 (Thread 0x7f042d7f2700 (LWP 3013)): #0 0x0000003dd00c7b57 in sched_yield () from /lib64/libc.so.6 #1 0x00007f044afea5d0 in __os_yield (env=0xec71e0, secs=0, usecs=0) at ../../os/os_yield.c:54 #2 0x00007f044ae6bf10 in __db_tas_mutex_lock_int (env=0xec71e0, mutex=2156, nowait=0) at ../../mutex/mut_tas.c:215 #3 0x00007f044ae6bfee in __db_tas_mutex_lock (env=0xec71e0, mutex=2156) at ../../mutex/mut_tas.c:248 #4 0x00007f044afcff2f in __memp_fget (dbmfp=0x7f041c011130, pgnoaddr=0x7f042d7e45b8, ip=0x0, Also, get the output from db_stat? I'm curious which page the memberof modop is waiting for...
Created attachment 526951 [details] Test Scripts This is a set of test scripts that can be used to reproduce the issue with a standalone 389 DS instance.
Here is what the pages look like in the memberOf index db file: 1 (btree internal) -->2 (btree leaf) -->3 (btree leaf) -->4 (btree leaf) ----->7 (btree internal) -------->8 (duplicate) -------->9 (duplicate) -->5 (btree leaf) ----->6 (btree internal) -------->10 (duplicate) -------->11 (duplicate) From what I can see in the debugger, the write thread has the following locks on pages: read locks = 4 write locks = 4, 5, 7, 9 The read thread has the following lock: read lock = 11 The deadlock is happening because the read thread is requesting a read lock on page 5, which is blocked since the write thread already has a write lock on page 5. The write thread is attempting to get a write lock on page 11, which is blocked since the read thread already holds a read lock on page 11. What is interesting is that db_stat does not show the read lock that the read thread has on page 11. It also does not list the write locks held by the write thread on pages 7 and 9 (though it seems to list this transaction as having 3 write locks on page 4). It seems that there is no deadlock detection occurring within Berkeley DB for duplicate pages.
Created attachment 526956 [details] db_dump output of memberOf.db4 This is the dbdump output of the memberOf.db4 index file. It shows the page layout.
I have posted a question about this issue on the Oracle Berkeley DB forums, as it looks more and more to be a problem inside of libdb: https://forums.oracle.com/forums/thread.jspa?threadID=2292442
Upstream ticket: https://fedorahosted.org/389/ticket/25
marking as screened because it has been cloned upstream
Created attachment 562317 [details] Updated test scripts This is an update to the previous test scripts that can be used to reproduce the issue. Since the issue is timing related, I've found that it sometimes takes multiple test runs to reproduce the issue. This new set of scripts automatically runs the test multiple times to attempt to reproduce the issue. The DS instance creation and cleanup is now automatically done by the scripts at the beginning and end of each test run. See the README file included in the test script tarball for details.
Changing the status to WORKSFORME. See also: https://fedorahosted.org/389/ticket/25