Bug 1349779 - checkpoint thread hangs
Summary: checkpoint thread hangs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libdb
Version: 7.2
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: 7.3
Assignee: Petr Kubat
QA Contact: Vaclav Danek
Lenka Špačková
URL:
Whiteboard:
: 1347800 (view as bug list)
Depends On:
Blocks: 1298243 1346768 1393868 1420851 1465891 1466363 1526929
TreeView+ depends on / blocked
 
Reported: 2016-06-24 09:03 UTC by Ludwig
Modified: 2021-09-09 11:52 UTC (History)
20 users (show)

Fixed In Version: libdb-5.3.21-22.el7
Doc Type: Release Note
Doc Text:
Internal buffer locks no longer cause deadlocks in *libdb* Previously, the *libdb* database did not lock its internal buffers in the correct order when it accessed pages located in an off-page duplicate (OPD) tree while processing operations on a cursor. A writer process accessed first the primary tree and then the OPD tree while a reader process did the same in the reverse order. When a writer process accessed a page from the primary tree while a reader process accessed a page from the OPD tree, the processes were unable to access the page from the other tree because the pages were simultaneously locked by the other process. This consequently caused a deadlock in *libdb* because neither of the processes released their locks. With this update, the `btree` version of the `cursor->get` method has been modified to lock the tree's pages in the same order as the writing methods, that is, the primary tree first and the OPD tree second. As a result, deadlocks in *libdb* no longer occur in the described scenario.
Clone Of:
: 1526929 1527825 (view as bug list)
Environment:
Last Closed: 2018-04-10 17:44:41 UTC
Target Upstream Version:


Attachments (Terms of Use)
db_stat -CA (92.89 KB, text/plain)
2016-06-24 09:31 UTC, Ludwig
no flags Details
db_stat -MA (5.73 MB, text/plain)
2016-06-24 09:33 UTC, Ludwig
no flags Details
db_stat -XA (8.47 MB, text/plain)
2016-06-24 09:37 UTC, Ludwig
no flags Details
full backtrace (759.71 KB, text/plain)
2016-07-06 11:34 UTC, Ludwig
no flags Details
New full st from bv. (578.20 KB, text/plain)
2016-09-30 13:41 UTC, German Parente
no flags Details
db_stat -M (sf01678120) (9.84 MB, text/plain)
2016-10-26 10:07 UTC, German Parente
no flags Details
db_stat -X sf01678120 (15.57 MB, text/plain)
2016-10-26 10:08 UTC, German Parente
no flags Details
stacktrace sf01678120 (516.24 KB, text/plain)
2016-10-26 10:09 UTC, German Parente
no flags Details
seems another ocurrence of this bug. (30.98 KB, text/plain)
2017-03-06 11:08 UTC, German Parente
no flags Details
another case showing this issue. (70.86 KB, text/plain)
2017-03-07 13:04 UTC, German Parente
no flags Details
Collection of rpms for rawhide that can 100% trigger the deadlock. (16.89 MB, application/x-gzip)
2017-07-11 00:22 UTC, wibrown@redhat.com
no flags Details
Patch that disables slapi global txn during compaction. (2.62 KB, application/mbox)
2017-07-11 00:37 UTC, wibrown@redhat.com
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0952 0 None None None 2018-04-10 17:45:24 UTC

Description Ludwig 2016-06-24 09:03:22 UTC
Description of problem:

The 389-ds process hangs because a database update is blocked by the checkpoint thread, the checkpoint thread is looping and not completing

Version-Release number of selected component (if applicable):

5.3.21

How reproducible:

Happens frequently in a customer environment, so far not reproducable in lab

Steps to Reproduce:
1.
2.
3.
No steps to reproduce, but maybe there is a pattern contributing to the situation. There is one database with many duplicate keys, this database is frequently updated and searched, a txn_checkpoint is run every 60 seconds and sometimes triggers this hang

Actual results:

server is hanging

Expected results:

update and checkpoint should succeed

Additional info:

Analysis from gcore in next update

Comment 1 Ludwig 2016-06-24 09:11:17 UTC
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f556a2f72f3 in __db_pthread_mutex_condwait (env=0x7f5573b239b0, mutex=140002973410896, timespec=0x0, mutexp=<optimized out>) at ../../src/mutex/mut_pthread.c:321
#2  __db_hybrid_mutex_suspend (env=env@entry=0x7f5573b239b0, mutex=mutex@entry=68561, timespec=timespec@entry=0x0, exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577
#3  0x00007f556a2f6640 in __db_tas_mutex_lock_int (nowait=0, timeout=0, mutex=<optimized out>, env=0x7f5573b239b0) at ../../src/mutex/mut_tas.c:271
#4  __db_tas_mutex_lock (env=env@entry=0x7f5573b239b0, mutex=68561, timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:302
#5  0x00007f556a410dfb in __memp_fget (dbmfp=dbmfp@entry=0x7f54cc00a060, pgnoaddr=pgnoaddr@entry=0x7f54fb7f696c, ip=0x0, txn=0x7f54d04d2280, flags=0, flags@entry=2, addrp=addrp@entry=0x7f54fb7f6978)
    at ../../src/mp/mp_fget.c:311
#6  0x00007f556a3141f1 in __bam_search (dbc=dbc@entry=0x7f54cc014e00, root_pgno=root_pgno@entry=22, key=key@entry=0x7f54fb7f6d40, flags=12802, slevel=slevel@entry=1, recnop=recnop@entry=0x0, 
    exactp=exactp@entry=0x7f54fb7f6b2c) at ../../src/btree/bt_search.c:806
#7  0x00007f556a2ff256 in __bamc_search (dbc=dbc@entry=0x7f54cc014e00, root_pgno=22, key=key@entry=0x7f54fb7f6d40, flags=flags@entry=19, exactp=exactp@entry=0x7f54fb7f6b2c) at ../../src/btree/bt_cursor.c:2804
#8  0x00007f556a3036c4 in __bamc_put (dbc=0x7f54cc014e00, key=0x7f54fb7f6e20, data=0x7f54fb7f6d40, flags=19, pgnop=0x0) at ../../src/btree/bt_cursor.c:2114
#9  0x00007f556a3ba6a5 in __dbc_iput (dbc=0x7f54cc00bae0, key=0x7f54fb7f6e20, data=0x7f54fb7f6d40, flags=19) at ../../src/db/db_cam.c:2155
#10 0x00007f556a3bc69d in __dbc_put (dbc=<optimized out>, key=key@entry=0x7f54fb7f6e20, data=data@entry=0x7f54fb7f6d40, flags=<optimized out>, flags@entry=19) at ../../src/db/db_cam.c:2049
#11 0x00007f556a3b59de in __db_put (dbp=dbp@entry=0x7f54cc014560, ip=<optimized out>, txn=<optimized out>, key=key@entry=0x7f54fb7f6e20, data=data@entry=0x7f54fb7f6d40, flags=flags@entry=19)
    at ../../src/db/db_am.c:583
#12 0x00007f556a3caf94 in __db_put_pp (dbp=0x7f54cc014560, txn=0x7f54d04d2280, key=0x7f54fb7f6e20, data=0x7f54fb7f6d40, flags=<optimized out>) at ../../src/db/db_iface.c:1661
#13 0x00007f5567af34cf in idl_new_insert_key (be=<optimized out>, db=<optimized out>, key=<optimized out>, id=6968, txn=<optimized out>, a=<optimized out>, disposition=0x0)
    at ldap/servers/slapd/back-ldbm/idl_new.c:801
#14 0x00007f5567af2355 in idl_insert_key (be=be@entry=0x7f55739176c0, db=db@entry=0x7f54cc014560, key=key@entry=0x7f54fb7f6e20, id=id@entry=6968, txn=txn@entry=0x7f54d04d2280, a=a@entry=0x7f5573b3c230, 
    disposition=disposition@entry=0x0) at ldap/servers/slapd/back-ldbm/idl_shim.c:115
#15 0x00007f5567b025b5 in addordel_values_sv (be=be@entry=0x7f55739176c0, db=0x7f54cc014560, indextype=<optimized out>, vals=<optimized out>, id=id@entry=6968, flags=flags@entry=1, 
    txn=txn@entry=0x7f54fb7f9670, a=0x7f5573b3c230, idl_disposition=idl_disposition@entry=0x0, buffer_handle=buffer_handle@entry=0x0, type=<optimized out>) at ldap/servers/slapd/back-ldbm/index.c:1922
#16 0x00007f5567b030a2 in index_addordel_values_ext_sv (be=be@entry=0x7f55739176c0, type=<optimized out>, vals=0x7f54d015d990, evals=evals@entry=0x0, id=id@entry=6968, flags=flags@entry=1, 
    txn=txn@entry=0x7f54fb7f9670, idl_disposition=idl_disposition@entry=0x0, buffer_handle=buffer_handle@entry=0x0) at ldap/servers/slapd/back-ldbm/index.c:2116
#17 0x00007f5567b03694 in index_addordel_values_sv (be=be@entry=0x7f55739176c0, type=<optimized out>, vals=<optimized out>, evals=evals@entry=0x0, id=id@entry=6968, flags=flags@entry=1, 
    txn=txn@entry=0x7f54fb7f9670) at ldap/servers/slapd/back-ldbm/index.c:1981
#18 0x00007f5567b03ebb in index_add_mods (be=0x7f55739176c0, mods=<optimized out>, olde=0x7f5489d10170, newe=0x7f54d01604d0, txn=txn@entry=0x7f54fb7f9670) at ldap/servers/slapd/back-ldbm/index.c:641
#19 0x00007f5567b20e99 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:705
#20 0x00007f557229c4e1 in op_shared_modify (pb=pb@entry=0x7f54fb7fdad0, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1054
#21 0x00007f557229d82f in do_modify (pb=pb@entry=0x7f54fb7fdad0) at ldap/servers/slapd/modify.c:387
#22 0x00007f5572788320 in connection_dispatch_operation (pb=0x7f54fb7fdad0, op=0x7f5573e5c7e0, conn=0x7f553c20ed30) at ldap/servers/slapd/connection.c:619
#23 connection_threadmain () at ldap/servers/slapd/connection.c:1751
#24 0x00007f557045d7bb in _pt_root (arg=0x7f5573e37460) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#25 0x00007f556fdf4dc5 in start_thread (arg=0x7f54fb7fe700) at pthread_create.c:308
#26 0x00007f556fb1b28d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113The following thread is hanging in an update and blocking all other readers:

Examining the mutex it waits on shows:

#3  0x00007f556a2f6640 in __db_tas_mutex_lock_int (nowait=0, timeout=0, mutex=<optimized out>, env=0x7f5573b239b0) at ../../src/mutex/mut_tas.c:271
        mtxmgr = <optimized out>
        mtxregion = 0x7f555e0d5138
        ip = 0x0
        timespec = {tv_sec = 0, tv_nsec = 140003048947713}
        ret = <optimized out>
        dbenv = 0x7f5573c7c6c0
        mutexp = 0x7f555eac55f0
        now = {tv_sec = 140004635196376, tv_nsec = 140004830440083}
        nspins = 0

 p *mutexp
$34 = {u = {m = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 128, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
        __size = '\000' <repeats 12 times>, "\001\000\000\000\200", '\000' <repeats 22 times>, __align = 0}, cond = {__data = {__lock = 0, __futex = 1, __total_seq = 1, __wakeup_seq = 0, __woken_seq = 0,
          __mutex = 0xffffffffffffffff, __nwaiters = 2, __broadcast_seq = 0},
        __size = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, "\377\377\377\377\377\377\377\377\002\000\000\000\000\000\000", __align = 4294967296}}, rwlock = {__data = {__lock = 0,
        __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 1, __nr_readers_queued = 128, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 4294967296, __flags = 1},
      __size = '\000' <repeats 12 times>, "\001\000\000\000\200", '\000' <repeats 27 times>, "\001\000\000\000\001\000\000\000\000\000\000", __align = 0}}, tas = 0 '\000', sharecount = {value = 1}, wait = 1,
  pid = 6669, tid = 140004094195456, mutex_next_link = 0, alloc_id = 15, mutex_set_wait = 0, mutex_set_nowait = 2847, mutex_set_rd_wait = 0, mutex_set_rd_nowait = 6010, hybrid_wait = 0, hybrid_wakeup = 0,
  flags = 49}

so the thread owning this mutex is: 140004094195456 == 0x7f553e4d3700

Switching to this thread:
[Switching to thread 2 (Thread 0x7f553e4d3700 (LWP 6672))]

#0  0x00007f556fb129b3 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f556a423b3d in __os_sleep (usecs=<optimized out>, secs=<optimized out>, env=0x7f5573b239b0) at ../../src/os/os_yield.c:90
#2  __os_yield (env=env@entry=0x7f5573b239b0, secs=<optimized out>, secs@entry=1, usecs=<optimized out>, usecs@entry=0) at ../../src/os/os_yield.c:48
#3  0x00007f556a41f293 in __memp_sync_int (env=env@entry=0x7f5573b239b0, dbmfp=dbmfp@entry=0x0, trickle_max=trickle_max@entry=0, flags=flags@entry=4, wrote_totalp=wrote_totalp@entry=0x0, 
    interruptedp=interruptedp@entry=0x0) at ../../src/mp/mp_sync.c:483
#4  0x00007f556a42f6c2 in __txn_checkpoint (env=env@entry=0x7f5573b239b0, kbytes=kbytes@entry=0, minutes=minutes@entry=0, flags=flags@entry=0) at ../../src/txn/txn_chkpt.c:242
#5  0x00007f556a42fae4 in __txn_checkpoint_pp (dbenv=<optimized out>, kbytes=0, minutes=0, flags=0) at ../../src/txn/txn_chkpt.c:81
#6  0x00007f5567ae7a87 in checkpoint_threadmain (param=<optimized out>) at ldap/servers/slapd/back-ldbm/dblayer.c:4695
#7  0x00007f557045d7bb in _pt_root (arg=0x7f5573b23490) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#8  0x00007f556fdf4dc5 in start_thread (arg=0x7f553e4d3700) at pthread_create.c:308
#9  0x00007f556fb1b28d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

It is the checkpointing thread in a loop in __memp_sync_int(), looking at the available data shows:
#3  0x00007f556a41f293 in __memp_sync_int (env=env@entry=0x7f5573b239b0, dbmfp=dbmfp@entry=0x0, trickle_max=trickle_max@entry=0, flags=flags@entry=4, wrote_totalp=wrote_totalp@entry=0x0, 
    interruptedp=interruptedp@entry=0x0) at ../../src/mp/mp_sync.c:483
        bhp = <optimized out>
        bharray = 0x7f55380008c0
        dbmp = 0x7f5573b14420
        hp = <optimized out>
        c_mp = <optimized out>
        mp = 0x7f553ecd5138
        mfp = <optimized out>
        mutex = <optimized out>
        last_mf_offset = 0
        ar_cnt = 867
        ar_max = <optimized out>
        i = 0
        n_cache = <optimized out>
        remaining = 2
        wrote_total = 865
        wrote_cnt = 865
        dirty = <optimized out>
        filecnt = 0
        maxopenfd = 0
        required_write = <optimized out>
        ret = 0
        t_ret = <optimized out>

So there are two buffers remaining, we will check why they are not written

Comment 3 Ludwig 2016-06-24 09:25:35 UTC
we are here: 
480             for (i = wrote_cnt = 0, remaining = ar_cnt; remaining > 0; ++i) {
481                     if (i >= ar_cnt) {
482                             i = 0;
483                             __os_yield(env, 1, 0);
484                     }
485                     if ((hp = bharray[i].track_hp) == NULL)
486                             continue;

and looking for track_hp != NULL
I did find two:

(gdb) p bharray[748]
$37 = {track_hp = 0x7f553ed19760, track_off = 23738584, track_pgno = 22}
(gdb) p bharray[739]
$38 = {track_hp = 0x7f553ece1ae0, track_off = 23738584, track_pgno = 6}
(gdb)

(gdb) p * bharray[739].track_hp
$2 = {mtx_hash = 917, hash_bucket = {stqh_first = 23795448, stqh_last = 23795472}, hash_page_dirty = {value = 1}, hash_io_wait = 0, hash_frozen = 0, hash_thawed = 0, hash_frozen_freed = 0, old_reader = {
    file = 0, offset = 0}, flags = 0}
p *bharray[748].track_hp
$36 = {mtx_hash = 4997, hash_bucket = {stqh_first = 405269416, stqh_last = 23575320}, hash_page_dirty = {value = 1}, hash_io_wait = 0, hash_frozen = 0, hash_thawed = 0, hash_frozen_freed = 0, old_reader = {
    file = 0, offset = 0}, flags = 0}


these buffers are found here: 
                MUTEX_READLOCK(env, mutex);
                SH_TAILQ_FOREACH(bhp, &hp->hash_bucket, hq, __bh)
                        if (bhp->pgno == bharray[i].track_pgno &&
                            bhp->mf_offset == bharray[i].track_off)
                                break;

I managed to find them by manually applying the TAILQ macros, the first is:

(gdb) p *(struct __bh *) 0x7f55403931e0
$15 = {mtx_buf = 68551, ref = {value = 2}, flags = 18, priority = 14753390, hq = {stqe_next = -1, stqe_prev = -23795448}, pgno = 6, mf_offset = 23738584, bucket = 894, region = 0, td_off = 0, vc = {
    sce_next = -1, sce_prev = -1}, buf = ""}

and teh second is:
 p *(struct __bh *) 0x7f5540395268
$33 = {mtx_buf = 68552, ref = {value = 1}, flags = 18, priority = 14753313, hq = {stqe_next = -1, stqe_prev = 381694144}, pgno = 22, mf_offset = 23738584, bucket = 4974, region = 0, td_off = 0, vc = {
    sce_next = -1, sce_prev = -1}, buf = <incomplete sequence \322>}

both have flags = 18, which translates to: BH_EXCLUSIVE | BH_DIRTY

since exclusive is set we end here:
                if (F_ISSET(bhp, BH_EXCLUSIVE)) {
                        MUTEX_UNLOCK(env, mutex);
                        if (!required_write) {
                                --remaining;
                                bharray[i].track_hp = NULL;
                        }
                        continue;
                }

and as we are doing a checkpoint we have required write  set and continue without writing and the loop will start again.

I cannot see how this thread would hold the mutex the update thread is waiting on.

Comment 4 Ludwig 2016-06-24 09:31:43 UTC
Created attachment 1171892 [details]
db_stat -CA

Comment 5 Ludwig 2016-06-24 09:33:09 UTC
Created attachment 1171894 [details]
db_stat -MA

Comment 6 Ludwig 2016-06-24 09:37:51 UTC
Created attachment 1171897 [details]
db_stat -XA

Comment 8 Marc Sauton 2016-06-29 15:38:31 UTC
would it make sense to suggest a test running with a slightly higher RHDS parameter nsslapd-db-checkpoint-interval value from the default of 60, like 70? (changing timing to work around this lock situation)

Comment 11 Ludwig 2016-07-06 11:34:08 UTC
Created attachment 1176839 [details]
full backtrace

Comment 18 Petr Kubat 2016-07-08 14:06:45 UTC
After looking through libdb's code I see no way the checkpoint thread should be able to hold the mutex needed by the update thread.
If we take a look on the checkpoint thread's bucket tracking array and look for the blocking entry, we can see that the thread has indeed processed it as track_hp == NULL.

(gdb) p *(BH_TRACK *) 0x7f55380052a0
$198 = {track_hp = 0x0, track_off = 23738584, track_pgno = 465}

If we look on where the change to NULL is happening:

582         /* we disposed of this buffer. */
583         --remaining;
584         bharray[i].track_hp = NULL;
585 
586         /* Discard our buffer reference. */
587         DB_ASSERT(env, atomic_read(&bhp->ref) > 0);
588         atomic_dec(env, &bhp->ref);
589         MUTEX_UNLOCK(env, bhp->mtx_buf);

This means that directly after setting track_hp = NULL the blocking mutex should have been released. Not only is it not released, but the reference value to the buffer has not been decreased as it should hold the value 1 (only 1 thread is waiting on this mutex).

(gdb) p *bhp
$165 = {mtx_buf = 68561, ref = {value = 2}, flags = 0, priority = 14753391, hq = {stqe_next = -1, stqe_prev = 333158376}, pgno = 465, mf_offset = 23738584, bucket = 54697, region = 0, td_off = 0, vc = {
    sce_next = -1, sce_prev = -1}, buf = <incomplete sequence \322>}

As it looks like the changes made to the buffer header did not live through the process, the fault may lie with the mmaped memory region files.

Comment 23 Gary Lamb 2016-07-20 22:21:19 UTC
Hi also received this from Oracle that I am passing along to the customer - 

	

>> 1. Ask the user to follow the deadlock debugging procedure that's documented at: https://docs.oracle.com/cd/E17076_02/html/programmer_reference/lock_deaddbg.html
> Did you perform the deadlock debugging procedure on the statistics? Did you find any unresolved deadlocks or self-deadlocks?

a. Do you call abort for all the abnormal termination of the transactions?
b. Can you provide the transactional stats as well?
c. Would it be possible to isolate this to a small test case so we can reproduce it locally?

Comment 27 German Parente 2016-09-30 13:41:38 UTC
Created attachment 1206276 [details]
New full st from bv.

Comment 34 German Parente 2016-10-26 10:07:36 UTC
Created attachment 1214239 [details]
db_stat -M (sf01678120)

Comment 35 German Parente 2016-10-26 10:08:58 UTC
Created attachment 1214240 [details]
db_stat -X sf01678120

Comment 36 German Parente 2016-10-26 10:09:53 UTC
Created attachment 1214242 [details]
stacktrace sf01678120

Comment 37 Ludwig 2016-10-26 10:43:41 UTC
the thread is creating a substring index for nsroledn and it is used in searches like:
nsroledn=cn=asb_ivs_admin,ou=asb_ivs_asbestos,ou=applications,*

this is very ineffective, it will do index lookups for all the triples ^cn, cn=, n=a, =as,...............

I am wondering if we could handle this better by using a range search on the equality index for this type of search.

If the substring index would be disabled that should already happen. And it would probably have no performance impact, but it would have to be tested.

Do we have a test database simulating the customer database - or can we get on ?

Comment 44 Noriko Hosoi 2016-11-04 23:12:26 UTC
*** Bug 1347800 has been marked as a duplicate of this bug. ***

Comment 51 German Parente 2017-03-06 11:08:21 UTC
Created attachment 1260364 [details]
seems another ocurrence of this bug.

Comment 54 German Parente 2017-03-07 13:04:03 UTC
Created attachment 1260796 [details]
another case showing this issue.

Comment 69 Petr Kubat 2017-03-15 08:43:30 UTC
I have found an open issue [1] on 389-ds's upstream tracker that seems similar (deadlock between a modrdn operation and the checkpointing thread). They are hitting this issue while running an upstream stress test. Trying to run the test to reproduce.

[1] https://pagure.io/389-ds-base/issue/48166

Comment 71 Petr Kubat 2017-03-20 07:42:33 UTC
Created attachment 1264706 [details]
Reproduced freeze

Comment 72 Adam Tkac 2017-04-27 09:18:00 UTC
Since this issue bites us a lot (around 5 hangs every day), I developed series of patches for 389-ds which basically protects checkpoint thread (dbenv->txn_checkpoint call) by pthread rwlock - txn_checkpoint can be only called when no other thread is inside libdb. And very interesting fact is that our testing freeipa server hung this night even with this patchset. So root cause isn't in the checkpoint logic because when 389-ds hung, checkpoint thread wasn't running at all. Stacktrace shows only hung cursor->get() and cursor->put() calls without running dbenv->txn_checkpoint(). With this fact in mind, txn_checkpoint() isn't root cause of hang, it's only result of some other deadlock which already happened before.

I will modify the patchset to ensure that only one thread is inside libdb at same time and check if it helps.

For reference, I put our patched 389-ds to https://atkac.fedorapeople.org/389-ds-base-1.3.5.10-20.gdc1.debug.src.rpm, feel free to let me know if you need more info or more debugging logs...

Comment 73 Petr Kubat 2017-04-27 10:15:31 UTC
Hi Adam, thank you for the information!

>With this fact in mind, txn_checkpoint() isn't root cause of hang, it's only result of some other deadlock which already happened before.

Indeed, I have already had the suspicion for a while that this is most likely what is happening. The checkpoint thread just gets blamed for it as the shared latch the deadlock happens on only saves information about the process that last accessed it.

Comment 74 Petr Kubat 2017-05-02 11:59:17 UTC
After Adam got to the same conclusion regarding the checkpoint thread's involvement in the issue as I did (comment 72) I tried running the reproducer with checkpointing turned off and sure enough the exact same freeze (same underlying file, same page types in buffer, thread performing a modrdn operation) got reproduced with the difference being that the shared latch "deadlock" is now an actual deadlock this time around, between two non-checkpointing threads (41 and 39), which seems quite similar to the occurrence from comments 34-36.

Comment 81 aheverle 2017-05-16 18:42:54 UTC
Created attachment 1279382 [details]
logconv

Comment 98 Ludwig 2017-07-10 09:22:21 UTC
no. they have a reproducer for some time, see comment #70

Comment 100 wibrown@redhat.com 2017-07-11 00:22:42 UTC
Created attachment 1295997 [details]
Collection of rpms for rawhide that can 100% trigger the deadlock.

This tarball contains rpms from my system with a patch attached that can trigger the deadlock 100% of the time.

To execute this you need to install these on a fedora-rawhide system (or rebuild the sprms for RHEL).

You can then trigger the issue with our basic test case as:

sudo py.test -s /usr/lib/python2.7/site-packages/dirsrvtests/tests/suites/basic

This should result in a dirsrv process spinning on 200% cpu as in:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dirsrv   20188  188  6.3 21476401496 764728 ?  Sl   15:50   0:32 /opt/dirsrv/sbin/ns-slapd -D /opt/dirsrv/etc/dirsrv/slapd-standalone_1 -i /opt/dirsrv/var/run/dirsrv/slapd-standalone_1.pid


*explanation*

The provided RPMS contain an in progress patch to correct time handling in DS. In this process we changed the timers of the checkpoint thread to use the system clock rather than the current_time() fn. 

During an ldif2db the checkpoint thread would be started, but the time thread that updated current_time() would not be started. As a result, the checkpoint and compact would never run during a ldif2db. However, with this patch, they now run as expected during normal server operation.

One of the first tests of the basic suite is to export and import an ldif - this now sets the stage for what goes wrong.

We have the import task running here:


Thread 61 (Thread 0x7fbaf551a700 (LWP 20293)):
#0  0x00007fbb490698eb in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fbb40619ca3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so
#2  0x00007fbb40619117 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so
#3  0x00007fbb40731e7c in __memp_fget () from /lib64/libdb-5.3.so
#4  0x00007fbb4063601d in __bam_search () from /lib64/libdb-5.3.so
#5  0x00007fbb40639564 in __bam_split () from /lib64/libdb-5.3.so
#6  0x00007fbb40625db6 in __bamc_put () from /lib64/libdb-5.3.so
#7  0x00007fbb406dc021 in __dbc_iput () from /lib64/libdb-5.3.so
#8  0x00007fbb406d7240 in __db_put () from /lib64/libdb-5.3.so
#9  0x00007fbb406ece7b in __db_put_pp () from /lib64/libdb-5.3.so
#10 0x00007fbb3c4c0113 in id2entry_add_ext (be=0x6110001f74c0, e=0x60800067ed20, txn=0x0, encrypt=0, cache_res=0x0) at /home/william/development/389ds/ds/ldap/servers/slapd/back-ldbm/id2entry.c:109
#11 0x00007fbb3c4fc3ec in import_foreman (param=0x607000119280) at /home/william/development/389ds/ds/ldap/servers/slapd/back-ldbm/import-threads.c:2673
#12 0x00007fbb492a40bb in _pt_root () from /lib64/libnspr4.so
#13 0x00007fbb490633a9 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fbb4892f7ff in clone () from /lib64/libc.so.6

This is adding a large volume of entries, AKA high write load.

At the same time, the checkpoint thread fires, and then triggers a db compaction:

#0  0x00007fbb490698eb in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fbb40619ca3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so
#2  0x00007fbb40619117 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so
#3  0x00007fbb40731e7c in __memp_fget () from /lib64/libdb-5.3.so
#4  0x00007fbb40739786 in __memp_ftruncate () from /lib64/libdb-5.3.so
#5  0x00007fbb406f028c in __db_free () from /lib64/libdb-5.3.so
#6  0x00007fbb406e17c3 in __db_exchange_page () from /lib64/libdb-5.3.so
#7  0x00007fbb4063dfe9 in __bam_merge () from /lib64/libdb-5.3.so
#8  0x00007fbb40640036 in __bam_compact_int () from /lib64/libdb-5.3.so
#9  0x00007fbb406dfe15 in __db_compact_int () from /lib64/libdb-5.3.so
#10 0x00007fbb406e94b5 in __db_compact_pp () from /lib64/libdb-5.3.so
#11 0x00007fbb3c4a352e in checkpoint_threadmain (param=0x6120000412c0) at /home/william/development/389ds/ds/ldap/servers/slapd/back-ldbm/dblayer.c:4497
#12 0x00007fbb492a40bb in _pt_root () from /lib64/libnspr4.so
#13 0x00007fbb490633a9 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fbb4892f7ff in clone () from /lib64/libc.so.6

Here is the code listing for that region. 

                rc = dblayer_get_id2entry(inst->inst_be, &db);
                if (!db || rc ) {
                    continue;
                }
                slapi_log_err(SLAPI_LOG_BACKLDBM, "checkpoint_threadmain", "Compacting DB start: %s\n",
                              inst->inst_name);
                rc = dblayer_txn_begin(inst->inst_be, NULL, &txn);
                if (rc) {
                    slapi_log_err(SLAPI_LOG_ERR, "checkpoint_threadmain", "compactdb: transaction begin failed: %d\n", rc);
                    break;
                }
                rc = db->compact(db, txn.back_txn_txn, NULL/*start*/, NULL/*stop*/, 
                                 &c_data, DB_FREE_SPACE, NULL/*end*/);
                if (rc) {
                    slapi_log_err(SLAPI_LOG_ERR,
                              "compactdb: failed to compact %s; db error - %d %s\n",
                              inst->inst_name, rc, db_strerror(rc));
                    if((rc = dblayer_txn_abort(inst->inst_be, &txn))){
                        slapi_log_err(SLAPI_LOG_ERR, "checkpoint_threadmain", "compactdb: failed to abort txn (%s) db error - %d %s\n",
                                  inst->inst_name, rc, db_strerror(rc));
                        break;
                    }


Suddenly deadlock.

I hope this helps you,

Comment 101 wibrown@redhat.com 2017-07-11 00:33:33 UTC
UPDATE: This can be resolved by running compact *without* a global TXN as per https://docs.oracle.com/cd/E17275_01/html/api_reference/C/BDB-C_APIReference.pdf page 16.

+#ifdef COMPACT_IN_TXN
                 rc = dblayer_txn_begin(inst->inst_be, NULL, &txn);
                 if (rc) {
                     slapi_log_err(SLAPI_LOG_ERR, "checkpoint_threadmain", "compactdb: transaction begin failed: %d\n", rc);
@@ -4515,6 +4516,23 @@ static int checkpoint_threadmain(void *param)
                         break;
                     }
                 }
+#else
+                /*
+                 * try compact using multiple smaller internal txns rather than
+                 * a large global txn
+                 */
+                rc = db->compact(db, NULL/* txn */, NULL/*start*/, NULL/*stop*/, 
+                                 &c_data, DB_FREE_SPACE, NULL/*end*/);
+                if (rc) {
+                    slapi_log_err(SLAPI_LOG_ERR,
+                              "compactdb: failed to compact %s; db error - %d %s\n",
+                              inst->inst_name, rc, db_strerror(rc));
+                } else {
+                    slapi_log_err(SLAPI_LOG_BACKLDBM,
+                                   "compactdb: compact %s - %d pages freed\n",
+                                   inst->inst_name, c_data.compact_pages_free);
+                }
+#endif

Comment 102 wibrown@redhat.com 2017-07-11 00:37:00 UTC
Created attachment 1296005 [details]
Patch that disables slapi global txn during compaction.

Comment 103 wibrown@redhat.com 2017-07-11 05:47:59 UTC
Further investigation showed that the above patch helped prevent the issue but didn't solve it. Removing "DB_FREE_SPACE;" seems to have helped to correct this for me in the interim.

Comment 104 Ludwig 2017-07-11 06:51:14 UTC
(In reply to wibrown from comment #103)
> Further investigation showed that the above patch helped prevent the issue
> but didn't solve it. Removing "DB_FREE_SPACE;" seems to have helped to
> correct this for me in the interim.

I am not sure this is teh same deadlock, did you investigate the locks with db_stat ? It could very well be that we now have a "real" deadlock caused by the appliocation (ds).

Comment 105 Petr Kubat 2017-07-11 10:39:39 UTC
After first looking at the case posted by William it looked similar to the original issue, with the difference that the checkpointing thread would be directly taking the place of one of the deadlocked threads. However, after reproducing the issue locally with the help of the reproducer provided and looking at the state of the process when the hang occurs, I have to agree that this might be a different issue entirely.

The key differences are:
The mutexes the threads are waiting on are both exclusively locked for writing (whereas in the original case, one of them was locked for reading) and both seem to be locked by the checkpointing thread. This might not be correct but, as the checkpointing thread is one of the deadlocked threads and is not spinning while touching multiple mutexes, it should be.

The types of pages that are being accessed are different. During the other hangs one thread is always trying to access Btree leaf page, while the other is accessing a Off-page duplicate leaf page (basically secondary indexes).
This time around the importing thread tries to access a Btree metadata page (pgno 0, so actually the root page) while the checkpointing thread is trying to access a page of type 0 - Invalid page type. I am unfortunately not sure what this page type means but my best guess would be that it is a free (or yet unused?) page.

Comment 106 Ludwig 2017-07-11 10:44:14 UTC
hi, looks like you are confirming my suspicion that this is another case, and the scenario is quite different from all the cases we haev seen so far.

About the page type 0, I also don't know what it is, but the thread is doing comaction and tries to "empty" pages to be freed later, so it is most likely one of the pages to be freed

Comment 107 Petr Kubat 2017-07-11 10:57:22 UTC
Well one of my suspicions for the culprit of the original hangs was some kind of a bug during mutex/page manipulations that would cause the mutex of a page to stay in a exclusively locked state even though it is actually unlocked. If the page is supposed to be free (and I think it is quite possible given that not using DB_FREE_SPACE can be used as a workaround) I would expect for the page's mutex to also be freed/unlocked and available for the checkpoinitng thread to pick up.

So the reproducer provided by William might be for a different scenario but with the same underlying issue causing it. I will look into this further.

Comment 108 wibrown@redhat.com 2017-07-11 23:42:24 UTC
Hey,

I was doing more testing (and thinking) about this and I think the answer is in compact. I have my server setup to run compact every 1/4 of a second, and I see a progression to:

[11/Jul/2017:17:44:57.700271711 +1000] - ERR - libdb - BDB2055 Lock table is out of available lock entries
[11/Jul/2017:17:44:57.714355409 +1000] - ERR - checkpoint_threadmain - compactdb: failed to compact userRoot; db error - 12 Cannot allocate memory

So perhaps the issue lies in the lock handling with compact. This affects both external transaction rc = db->compact(db, txn.back_txn_txn, ...) and internal rc = db->compact(db, NULL, ...)

Perhaps this is a place to start? You could write a test case with two threads, one that inserts a bunch of data, one that just calls compact repeatedly?

Comment 109 wibrown@redhat.com 2017-07-11 23:43:35 UTC
I think this means for now to customers, we can advise they set compactdb interval to 0, which will prevent the compact running and could work around the issue.

Comment 110 Ludwig 2017-07-12 06:41:50 UTC
(In reply to wibrown from comment #109)
> I think this means for now to customers, we can advise they set compactdb
> interval to 0, which will prevent the compact running and could work around
> the issue.

I don't think that the hangs we have seen at customers installations are related to compact, - the default compact interval is 30 days, - the stacks are diferent

But I think comact has a lot of problems on its own, eg the huge amount of locks needded, so I would also recommend disabling compact.

Comment 111 Petr Kubat 2017-07-12 13:05:23 UTC
>I don't think that the hangs we have seen at customers installations are related to compact, - the default compact interval is 30 days, - the stacks are different

Indeed, they are not. Using the previous reproducer the stacks and internals at the moment of the hang are quite similar to what our customers are seeing and that hang is being reproduced even when the checkpointing thread is not running at all.

Comment 112 Ludwig 2017-07-12 13:09:46 UTC
but in DS, db->compact is only executed in the checkpoint thread

Comment 113 Petr Kubat 2017-07-26 10:27:10 UTC
After playing around with the new reproducer provided by William a bit more I think I have figured what is causing the hangs to occur.

libdb stores numbers of pages that have been freed by the __db_free internal function into a sorted list (pointed to from the MPOOLFILE data structure) so that they can be later reused or accessed by other internals. This works ok until some action introduces duplicate entries into the list which cause the process to hang during some operations (i.e. returning the duplicate page back into the freelist and trying to sort it after itself -> self-deadlock). 

I am not sure what that something is yet. To me it looks like multiple threads are accessing (and modifying) the freelist at the same time but that should not happen as the access is controlled via a lock on the meta page. Also no idea as of yet if this is in the end connected to the original issue the customers are facing but is definitely similar.

Comment 114 wibrown@redhat.com 2017-07-27 00:02:02 UTC
That's great news that you have made progress on this. I know it's not my area of expertise, but I've seen issues like this where even though the freelist is protected by a lock, the *caller* to the freelist may or may not be fully synced at the cpu cache level. So the caller who is inputing or retriving the pages, may still have a reference to the page, and may still be updating it. Perhaps this is the place to look? What if it's some kind of unlocked optimisation? 

Anyway, glad to know you are making progress :)

Comment 115 Petr Kubat 2017-08-01 07:06:43 UTC
>Perhaps this is the place to look? What if it's some kind of unlocked optimisation? 

I have tried compiling both libdb and the DS without any optimisations back when I got the first reproducer for the issue and neither helped in any way. I have also tried to do the same now with your srpm of the DS. Unfortunately no luck there either.

It seems though that the hangs are not contained to a specific part of the code, After introducing some more debugging output that was supposed to help me the problem I described via comment 113 I now see a completely different hang that is happening while calling `plugin_startall`. This time it seems a single thread is looping while trying to run `get_ids_from_disk`. Will have to look through this further to provide more information.

Comment 120 Petr Kubat 2017-08-17 11:03:09 UTC
>To me it looks like multiple threads are accessing (and modifying) the freelist at the same time but that should not happen as the access is controlled via a lock on the meta page.

Actually this would be the case IF the locking subsystem was enabled. Looking at env->open_flags at the time of the hang one can see that the libdb environment was not opened with DB_INIT_LOCK, hence the critical section is not at all protected, leaving the threads to modify the freelist as they choose, resulting in duplicated entries when timed right...

I guess this will not be connected to the original issue after all (as I think I have seen the locking subsystem enabled in earlier reports?) but can someone point me to where I can configure the DS to use locking just to make sure?

Comment 121 Ludwig 2017-08-17 11:38:50 UTC
ds opens the env with locking if txns are enabled:

    if (priv->dblayer_enable_transactions) {
        open_flags |= (DB_INIT_TXN | DB_INIT_LOG | DB_INIT_LOCK);

and later 

 return_value = (pEnv->dblayer_DB_ENV->open)(
            pEnv->dblayer_DB_ENV,
            region_dir,
            open_flags,
            priv->dblayer_file_mode
            );

so I don't know how you did run into a state without locking enabled

Comment 122 Petr Kubat 2017-08-18 07:24:30 UTC
>so I don't know how you did run into a state without locking enabled

All I am doing is running the basic testsuite with the srpm provided in comment 100.

A quick grep of the 389-ds code revealed a few lines like this one:

  ldbm_config_internal_set(li, CONFIG_DB_TRANSACTION_LOGGING, "off");

inside `ldap/servers/slapd/back-ldbm/ldif2ldbm.c`. Given that the hang occurs during the `test_basic_import_export` test case I guess the transactions and locking are getting disabled during the import.

Comment 123 Ludwig 2017-08-21 07:08:44 UTC
yes, that is an import of a database which wipes out all the previous content, it fail or succeeds totally and we don't use txn there.

But this is only for the import mode, and I think all the hangs we have seen was in "normal" mode

Comment 124 Petr Kubat 2017-08-21 13:49:55 UTC
>But this is only for the import mode, and I think all the hangs we have seen was in "normal" mode

Yes, that is my understanding as well. Which means the issue reported by William has nothing in common with the original bug as it is easily fixed just by using locking during the import.

Unfortunate, back to square one.

Comment 125 thierry bordaz 2017-09-04 11:51:00 UTC
Just a head up, I wonder if a customer is not hitting this issue (https://pagure.io/389-ds-base/issue/49360). Backtrace is matching https://bugzilla.redhat.com/show_bug.cgi?id=1349779#c1 and https://bugzilla.redhat.com/show_bug.cgi?id=1349779#c32

Is there an "easy" way to confirm that #49360 hit that bug ?

Comment 182 errata-xmlrpc 2018-04-10 17:44:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0952


Note You need to log in before you can comment on or make changes to this bug.