Bug 1347800

Summary: possible lock in libdb while reading/writing indexes.
Product: Red Hat Enterprise Linux 7 Reporter: German Parente <gparente>
Component: 389-ds-baseAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED DUPLICATE QA Contact: Viktor Ashirov <vashirov>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: afarley, clasohm, fjayalat, glamb, gparente, hmiura, jnansi, lkrispen, nkinder, rmeggins, tmihinto
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 23:12:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
case 01649493 1
none
case 01649493 2
none
case 01550343 none

Description German Parente 2016-06-17 16:44:50 UTC
Description of problem:

I apologize for logging this bug with no much information than the pstack.

But this week and former week, I have seen three different cases, no use case related sharing the same issue and more or less same pstacks.

Once case is ipa and two others pure RHDS. 

The cases are: 

01649493: [PROACTIVE] Production Upgrade/Cutover Case
=> experiencing hangs (pure rhds case with high load)

01550343: Kerberos stops working
=> ipa with high load.

01632462: Improper replication from single Supplier to all connected Consumers in Muti-Master and Multiple-Instance environment. Second master reported "db error - -30993 BDB0068 DB_LOCK_DEADLOCK" and no errors in consumers
=> pure RHDS ( high load, lots of issues, related to bz#1339693 DB_LOCK_DEADLOCK error prevents the replication of a few entries.

The pattern is always the same:

- several threads doing a search and locked at reading indexes
- one thread doing a modify and updating indexes
- at least another thread doing another modify and locking the backend.

We don't have the db_stat -C A  yet but we have asked in the three cases.


Version-Release number of selected component (if applicable): 

389-ds-base-1.3.4.0-30.el7_2.x86_64, 389-ds-base-1.3.4.0-26.el7_2.x86_64, 389-ds-base-1.3.4.0-30.el7.bz1310848.x86_64 (fix for skipped repl. changes)

I don't really know if this is a bug but the fact of having three customers claiming a hang and showing the same pattern makes it suspicious and interesting to check.

How reproducible: very difficult to reproduce.

Showing here one of the three pstacks with the mentioned pattern (case 01649493)
==========================================================================

replication doing an internal modify and locking backend.

Thread 27 (Thread 0x7fab0bfff700 (LWP 931)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fabd93892c3 in PR_EnterMonitor (mon=0x7fabdd391de0) at ../../../nspr/pr/src/pthreads/ptsynch.c:592
#2  0x00007fabd0a124c2 in dblayer_lock_backend (be=<optimized out>) at ldap/servers/slapd/back-ldbm/dblayer.c:3904
#3  0x00007fabd0a173e6 in dblayer_txn_begin (be=0x7fabdd39ba00, parent_txn=0x0, txn=txn@entry=0x7fab0bffc8a0) at ldap/servers/slapd/back-ldbm/dblayer.c:3623
#4  0x00007fabd0a53007 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:531
#5  0x00007fabdb1cd4e1 in op_shared_modify (pb=pb@entry=0x7fab0000ddf0, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1054
#6  0x00007fabdb1cdfa4 in modify_internal_pb (pb=pb@entry=0x7fab0000ddf0) at ldap/servers/slapd/modify.c:599
#7  0x00007fabdb1ceaa3 in slapi_modify_internal_pb (pb=pb@entry=0x7fab0000ddf0) at ldap/servers/slapd/modify.c:454
#8  0x00007fabd0793314 in replica_write_ruv (r=r@entry=0x7fabdd606100) at ldap/servers/plugins/replication/repl5_replica.c:2894
#9  0x00007fabd079359d in replica_update_state (when=<optimized out>, arg=<optimized out>) at ldap/servers/plugins/replication/repl5_replica.c:2827
#10 0x00007fabdb1a826a in eq_call_all () at ldap/servers/slapd/eventq.c:283
#11 eq_loop (arg=<optimized out>) at ldap/servers/slapd/eventq.c:330
#12 0x00007fabd938e7bb in _pt_root (arg=0x7fabdd5def70) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#13 0x00007fabd8d25dc5 in start_thread (arg=0x7fab0bfff700) at pthread_create.c:308
#14 0x00007fabd8a4c28d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


several threads reading indexes like this (23 threads in this case):

Thread 26 (Thread 0x7fab0b7fe700 (LWP 932)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fabd32282f3 in __db_pthread_mutex_condwait (env=0x7fabdd38fb60, mutex=7477538062336, timespec=0x0, mutexp=<optimized out>) at ../../src/mutex/mut_pthread.c:321
#2  __db_hybrid_mutex_suspend (env=env@entry=0x7fabdd38fb60, mutex=mutex@entry=470113, timespec=timespec@entry=0x0, exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577
#3  0x00007fabd3227640 in __db_tas_mutex_lock_int (nowait=0, timeout=0, mutex=<optimized out>, env=0x7fabdd38fb60) at ../../src/mutex/mut_tas.c:271
#4  __db_tas_mutex_lock (env=env@entry=0x7fabdd38fb60, mutex=470113, timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:302
#5  0x00007fabd32d1d0a in __lock_get_internal (lt=lt@entry=0x7fabdd36e920, sh_locker=sh_locker@entry=0x7fabc8f26868, flags=flags@entry=0, obj=<optimized out>, lock_mode=<optimized out>, timeout=timeout@entry=0, lock=0x7fab0b7efe40) at ../../src/lock/lock.c:983
#6  0x00007fabd32d27f0 in __lock_get (env=env@entry=0x7fabdd38fb60, locker=0x7fabc8f26868, flags=0, obj=obj@entry=0x7faa635b1c90, lock_mode=<optimized out>, lock=lock@entry=0x7fab0b7efe40) at ../../src/lock/lock.c:463
#7  0x00007fabd32fe132 in __db_lget (dbc=dbc@entry=0x7faa635b1ba0, action=action@entry=0, pgno=17, mode=<optimized out>, mode@entry=DB_LOCK_READ, lkflags=lkflags@entry=0, lockp=lockp@entry=0x7fab0b7efe40) at ../../src/db/db_meta.c:1255
#8  0x00007fabd32455e5 in __bam_search (dbc=dbc@entry=0x7faa635b1ba0, root_pgno=1, root_pgno@entry=0, key=key@entry=0x7fab0b7f0120, flags=1409, slevel=slevel@entry=1, recnop=recnop@entry=0x0, exactp=exactp@entry=0x7fab0b7eff64) at ../../src/btree/bt_search.c:723
#9  0x00007fabd3230256 in __bamc_search (dbc=dbc@entry=0x7faa635b1ba0, root_pgno=root_pgno@entry=0, key=0x7fab0b7f0120, flags=<optimized out>, exactp=0x7fab0b7eff64) at ../../src/btree/bt_cursor.c:2804
#10 0x00007fabd3231d0f in __bamc_get (dbc=0x7faa635b1ba0, key=<optimized out>, data=<optimized out>, flags=<optimized out>, pgnop=0x7fab0b7f0004) at ../../src/btree/bt_cursor.c:1099
#11 0x00007fabd32eac76 in __dbc_iget (dbc=0x7faa62a6cf50, key=0x7fab0b7f0120, data=0x7fab0b7f0150, flags=26) at ../../src/db/db_cam.c:952
#12 0x00007fabd32eb44d in __dbc_get (dbc=dbc@entry=0x7faa62a6cf50, key=key@entry=0x7fab0b7f0120, data=data@entry=0x7fab0b7f0150, flags=flags@entry=2074) at ../../src/db/db_cam.c:770
#13 0x00007fabd32f9af2 in __dbc_get_pp (dbc=0x7faa62a6cf50, key=0x7fab0b7f0120, data=0x7fab0b7f0150, flags=2074) at ../../src/db/db_iface.c:2359
#14 0x00007fabd0a23680 in idl_new_fetch (be=0x7fabdd39ba00, db=<optimized out>, inkey=0x7fab0b7f22a0, txn=0x0, a=0x7fabdd38c270, flag_err=0x7fab0b7f6ecc, allidslimit=20000) at ldap/servers/slapd/back-ldbm/idl_new.c:202
#15 0x00007fabd0a23305 in idl_fetch_ext (be=be@entry=0x7fabdd39ba00, db=<optimized out>, key=key@entry=0x7fab0b7f22a0, txn=txn@entry=0x0, a=<optimized out>, err=err@entry=0x7fab0b7f6ecc, allidslimit=allidslimit@entry=20000) at ldap/servers/slapd/back-ldbm/idl_shim.c:101
#16 0x00007fabd0a31b26 in index_read_ext_allids (pb=pb@entry=0x7fab0b7fdad0, be=be@entry=0x7fabdd39ba00, type=type@entry=0x7faaf1dad7a0 "ptcwebaccountstatus", indextype=indextype@entry=0x7fabd0a78954 "sub", val=<optimized out>, txn=txn@entry=0x7fab0b7f65d0, err=err@entry=0x7fab0b7f6ecc, unindexed=unindexed@entry=0x7fab0b7f65b0, allidslimit=allidslimit@entry=20000) at ldap/servers/slapd/back-ldbm/index.c:1028
#17 0x00007fabd0a1c284 in keys2idl (pb=pb@entry=0x7fab0b7fdad0, be=be@entry=0x7fabdd39ba00, type=0x7faaf1dad7a0 "ptcwebaccountstatus", indextype=0x7fabd0a78954 "sub", ivals=<optimized out>, err=err@entry=0x7fab0b7f6ecc, unindexed=unindexed@entry=0x7fab0b7f65b0, txn=txn@entry=0x7fab0b7f65d0, allidslimit=allidslimit@entry=20000) at ldap/servers/slapd/back-ldbm/filterindex.c:977
#18 0x00007fabd0a1d855 in substring_candidates (allidslimit=20000, err=0x7fab0b7f6ecc, f=0x7fabdd39ba00, be=0x7fabdd39ba00, pb=0x7fab0b7fdad0) at ldap/servers/slapd/back-ldbm/filterindex.c:943
#19 filter_candidates_ext (pb=pb@entry=0x7fab0b7fdad0, be=be@entry=0x7fabdd39ba00, base=base@entry=0x7faa60137f00 "ou=PTCRoot,dc=ptc,dc=com", f=f@entry=0x7faa6242dd50, nextf=nextf@entry=0x0, range=range@entry=0, err=err@entry=0x7fab0b7f6ecc, allidslimit=allidslimit@entry=20000) at ldap/servers/slapd/back-ldbm/filterindex.c:106
#20 0x00007fabd0a1e096 in list_candidates (pb=pb@entry=0x7fab0b7fdad0, be=be@entry=0x7fabdd39ba00, base=base@entry=0x7faa60137f00 "ou=PTCRoot,dc=ptc,dc=com", flist=flist@entry=0x7faa635b1e70, ftype=<optimized out>, err=0x7fab0b7f6ecc, allidslimit=20000) at ldap/servers/slapd/back-ldbm/filterindex.c:808
#21 0x00007fabd0a1cf40 in filter_candidates_ext (pb=pb@entry=0x7fab0b7fdad0, be=be@entry=0x7fabdd39ba00, base=base@entry=0x7faa60137f00 "ou=PTCRoot,dc=ptc,dc=com", f=f@entry=0x7faa635b1e70, nextf=nextf@entry=0x0, range=range@entry=0, err=err@entry=0x7fab0b7f6ecc, allidslimit=allidslimit@entry=20000) at ldap/servers/slapd/back-ldbm/filterindex.c:144
#22 0x00007fabd0a1e096 in list_candidates (pb=pb@entry=0x7fab0b7fdad0, be=be@entry=0x7fabdd39ba00, base=base@entry=0x7faa60137f00 "ou=PTCRoot,dc=ptc,dc=com", flist=flist@entry=0x7faaf1bc61d0, ftype=<optimized out>, err=0x7fab0b7f6ecc, allidslimit=20000) at ldap/servers/slapd/back-ldbm/filterindex.c:808
#23 0x00007fabd0a1cf40 in filter_candidates_ext (pb=pb@entry=0x7fab0b7fdad0, be=be@entry=0x7fabdd39ba00, base=base@entry=0x7faa60137f00 "ou=PTCRoot,dc=ptc,dc=com", f=0x7faaf1bc61d0, nextf=nextf@entry=0x0, range=range@entry=0, err=err@entry=0x7fab0b7f6ecc, allidslimit=allidslimit@entry=20000) at ldap/servers/slapd/back-ldbm/filterindex.c:144
#24 0x00007fabd0a5880c in subtree_candidates (pb=pb@entry=0x7fab0b7fdad0, be=0x7fabdd39ba00, base=base@entry=0x7faa60137f00 "ou=PTCRoot,dc=ptc,dc=com", e=e@entry=0x7fabdd15f260, filter=0x7faaf1bc61d0, managedsait=1, allids_before_scopingp=allids_before_scopingp@entry=0x7fab0b7f6ebc, err=err@entry=0x7fab0b7f6ecc) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1208
#25 0x00007fabd0a59e7f in build_candidate_list (candidates=0x7fab0b7f6ef8, lookup_returned_allidsp=0x7fab0b7f6ebc, scope=<optimized out>, base=0x7faa60137f00 "ou=PTCRoot,dc=ptc,dc=com", e=<optimized out>, be=<optimized out>, pb=0x7fab0b7fdad0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1015
#26 ldbm_back_search (pb=0x7fab0b7fdad0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:657
#27 0x00007fabdb1d4176 in op_shared_search (pb=pb@entry=0x7fab0b7fdad0, send_result=send_result@entry=1) at ldap/servers/slapd/opshared.c:802
#28 0x00007fabdb6cb74e in do_search (pb=pb@entry=0x7fab0b7fdad0) at ldap/servers/slapd/search.c:349
#29 0x00007fabdb6b9567 in connection_dispatch_operation (pb=0x7fab0b7fdad0, op=0x7fabddb04f30, conn=0x7fabdd652bd0) at ldap/servers/slapd/connection.c:643
#30 connection_threadmain () at ldap/servers/slapd/connection.c:1751
#31 0x00007fabd938e7bb in _pt_root (arg=0x7fabdd94ea20) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#32 0x00007fabd8d25dc5 in start_thread (arg=0x7fab0b7fe700) at pthread_create.c:308
#33 0x00007fabd8a4c28d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


one mod/add updating the indexes:

Thread 19 (Thread 0x7faaff7fe700 (LWP 939)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fabd32282f3 in __db_pthread_mutex_condwait (env=0x7fabdd38fb60, mutex=140372407707216, timespec=0x0, mutexp=<optimized out>) at ../../src/mutex/mut_pthread.c:321
#2  __db_hybrid_mutex_suspend (env=env@entry=0x7fabdd38fb60, mutex=mutex@entry=512626, timespec=timespec@entry=0x0, exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577
#3  0x00007fabd3227640 in __db_tas_mutex_lock_int (nowait=0, timeout=0, mutex=<optimized out>, env=0x7fabdd38fb60) at ../../src/mutex/mut_tas.c:271
#4  __db_tas_mutex_lock (env=env@entry=0x7fabdd38fb60, mutex=512626, timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:302
#5  0x00007fabd3341dfb in __memp_fget (dbmfp=dbmfp@entry=0x7faab4042150, pgnoaddr=pgnoaddr@entry=0x7faaff7f696c, ip=0x0, txn=0x7faa8ea33820, flags=0, flags@entry=2, addrp=addrp@entry=0x7faaff7f6978) at ../../src/mp/mp_fget.c:311
#6  0x00007fabd32451f1 in __bam_search (dbc=dbc@entry=0x7faa7e1563c0, root_pgno=root_pgno@entry=24, key=key@entry=0x7faaff7f6d40, flags=12802, slevel=slevel@entry=1, recnop=recnop@entry=0x0, exactp=exactp@entry=0x7faaff7f6b2c) at ../../src/btree/bt_search.c:806
#7  0x00007fabd3230256 in __bamc_search (dbc=dbc@entry=0x7faa7e1563c0, root_pgno=24, key=key@entry=0x7faaff7f6d40, flags=flags@entry=19, exactp=exactp@entry=0x7faaff7f6b2c) at ../../src/btree/bt_cursor.c:2804
#8  0x00007fabd32346c4 in __bamc_put (dbc=0x7faa7e1563c0, key=0x7faaff7f6e20, data=0x7faaff7f6d40, flags=19, pgnop=0x0) at ../../src/btree/bt_cursor.c:2114
#9  0x00007fabd32eb6a5 in __dbc_iput (dbc=0x7faab4042600, key=0x7faaff7f6e20, data=0x7faaff7f6d40, flags=19) at ../../src/db/db_cam.c:2155
#10 0x00007fabd32ed69d in __dbc_put (dbc=<optimized out>, key=key@entry=0x7faaff7f6e20, data=data@entry=0x7faaff7f6d40, flags=<optimized out>, flags@entry=19) at ../../src/db/db_cam.c:2049
#11 0x00007fabd32e69de in __db_put (dbp=dbp@entry=0x7faab4041a40, ip=<optimized out>, txn=<optimized out>, key=key@entry=0x7faaff7f6e20, data=data@entry=0x7faaff7f6d40, flags=flags@entry=19) at ../../src/db/db_am.c:583
#12 0x00007fabd32fbf94 in __db_put_pp (dbp=0x7faab4041a40, txn=0x7faa8ea33820, key=0x7faaff7f6e20, data=0x7faaff7f6d40, flags=<optimized out>) at ../../src/db/db_iface.c:1661
#13 0x00007fabd0a244cf in idl_new_insert_key (be=<optimized out>, db=<optimized out>, key=<optimized out>, id=7102, txn=<optimized out>, a=<optimized out>, disposition=0x0) at ldap/servers/slapd/back-ldbm/idl_new.c:801
#14 0x00007fabd0a23355 in idl_insert_key (be=be@entry=0x7fabdd39ba00, db=db@entry=0x7faab4041a40, key=key@entry=0x7faaff7f6e20, id=id@entry=7102, txn=txn@entry=0x7faa8ea33820, a=a@entry=0x7fabdd38c270, disposition=disposition@entry=0x0) at ldap/servers/slapd/back-ldbm/idl_shim.c:115
#15 0x00007fabd0a335b5 in addordel_values_sv (be=be@entry=0x7fabdd39ba00, db=0x7faab4041a40, indextype=<optimized out>, vals=<optimized out>, id=id@entry=7102, flags=flags@entry=1, txn=txn@entry=0x7faaff7f9670, a=0x7fabdd38c270, idl_disposition=idl_disposition@entry=0x0, buffer_handle=buffer_handle@entry=0x0, type=<optimized out>) at ldap/servers/slapd/back-ldbm/index.c:1922
#16 0x00007fabd0a340a2 in index_addordel_values_ext_sv (be=be@entry=0x7fabdd39ba00, type=<optimized out>, vals=0x7faa8f0a27a0, evals=evals@entry=0x0, id=id@entry=7102, flags=flags@entry=1, txn=txn@entry=0x7faaff7f9670, idl_disposition=idl_disposition@entry=0x0, buffer_handle=buffer_handle@entry=0x0) at ldap/servers/slapd/back-ldbm/index.c:2116
#17 0x00007fabd0a34694 in index_addordel_values_sv (be=be@entry=0x7fabdd39ba00, type=<optimized out>, vals=<optimized out>, evals=evals@entry=0x0, id=id@entry=7102, flags=flags@entry=1, txn=txn@entry=0x7faaff7f9670) at ldap/servers/slapd/back-ldbm/index.c:1981
#18 0x00007fabd0a34ebb in index_add_mods (be=0x7fabdd39ba00, mods=<optimized out>, olde=0x7faae0eddef0, newe=0x7faa8f0a2450, txn=txn@entry=0x7faaff7f9670) at ldap/servers/slapd/back-ldbm/index.c:641
#19 0x00007fabd0a51e99 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:705
#20 0x00007fabdb1cd4e1 in op_shared_modify (pb=pb@entry=0x7faaff7fdad0, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1054
#21 0x00007fabdb1ce82f in do_modify (pb=pb@entry=0x7faaff7fdad0) at ldap/servers/slapd/modify.c:387
#22 0x00007fabdb6b9320 in connection_dispatch_operation (pb=0x7faaff7fdad0, op=0x7fabddb21bf0, conn=0x7fabdd650fb0) at ldap/servers/slapd/connection.c:619
#23 connection_threadmain () at ldap/servers/slapd/connection.c:1751
#24 0x00007fabd938e7bb in _pt_root (arg=0x7fabdd9505b0) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#25 0x00007fabd8d25dc5 in start_thread (arg=0x7faaff7fe700) at pthread_create.c:308
#26 0x00007fabd8a4c28d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

=================================================================

Additional info:

Ludwig has taken a look at the pstack in case 01649493

Comment 2 German Parente 2016-06-17 16:57:43 UTC
Created attachment 1169181 [details]
case 01649493 1

Comment 3 German Parente 2016-06-17 16:58:15 UTC
Created attachment 1169182 [details]
case 01649493 2

Comment 4 German Parente 2016-06-17 16:58:44 UTC
Created attachment 1169183 [details]
case 01550343

Comment 19 Ludwig 2016-06-24 09:30:41 UTC
I have opened a bug for libdb for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1349779

we now need to wait for feedback from bdb team or oracle

Comment 28 Ludwig 2016-08-25 11:46:13 UTC
I don't see the patch, bu I think there are two things to clarify:
- is the customer willing to apply the patch and reproduce ? This would mean to recreate the index causing problems and backing out other patche they have (connection hang)
- ask the libdb maintiners to get a testbuild with the patch.

As far as I noticed, customer had no longer seen the hangs and was the only deployment where the problem consistently showed. Without the we have no environment to test the patch

Comment 29 Noriko Hosoi 2016-11-04 18:26:58 UTC
Hello, German and Gary.

Ludwig noted this bug 1347800 is the consequence of the bug in libdb:
https://bugzilla.redhat.com/show_bug.cgi?id=1349779

May we close this bug as a duplicate of bz1349779?

Thanks.

Comment 30 German Parente 2016-11-04 22:47:50 UTC
Hi,

Yes. This should be closed since it's a consequence of libdb bug. 

Thanks. German.