Bug 1751295

Summary: When sync-repl is enabled, slapi-nis can deadlock during retrochanglog trimming
Product: Red Hat Enterprise Linux 8 Reporter: thierry bordaz <tbordaz>
Component: slapi-nisAssignee: thierry bordaz <tbordaz>
Status: CLOSED ERRATA QA Contact: ipa-qe <ipa-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: ---CC: abokovoy, afarley, aheverle, brokenjacobs, ddas, gregor.jocher, jaredl, ksiddiqu, ldelouw, msauton, myusuf, pcech, rickatnight11, tmihinto
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: slapi-nis-0.56.5 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1820124 (view as bug list) Environment:
Last Closed: 2020-11-04 02:50:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1820124    

Description thierry bordaz 2019-09-11 16:15:46 UTC
Description of problem:
The problem occurs because of two threads trying to acquire locks in the opposite order

Version-Release number of selected component (if applicable):
Since 0.56.0-13 

How reproducible:
A high update rate with an high number of updates to trim in the retroCL

Steps to Reproduce:
1. disable retroCL trim (maxage=-1)
2. fill a high number of dummy updates (to have lot of records in retroCL)
3. enable retroCL triming let's say with 1h interval and a maxage=5s to give time to apply load (step 4) and have a lot of entries to trim
4. apply dummy updates load

Actual results:
sometime it deadlocks


Expected results:
Should not deadlock

Additional info:

Comment 2 thierry bordaz 2019-09-11 16:22:18 UTC
This bug is a side effect 1435663. 1435663 acquires the slapi-nis lock from bepreop --> bepostop, so it is held during possible updates triggered by betxn plugins.

Comment 3 thierry bordaz 2019-09-11 16:23:01 UTC
    Slapi-nis deadlock can happen between two threads when sync-repl is enabled.
    It happens when a retroCL trimming thread is running at the same time that
    an update is registered in the retroCL.
    An example is
Thread 19 (Thread 0x7ff6e88e7700 (LWP 27945)):
#0  0x00007ff775e98a35 in pthread_cond_wait@@GLIBC_2.3.2 (..) at /lib64/libpthread.so.0
#1  0x00007ff7764ef483 in PR_EnterMonitor (..) at /lib64/libnspr4.so
#2  0x00007ff76a346f10 in dblayer_lock_backend (..) at ldap/servers/slapd/back-ldbm/dblayer.c:3537
#3  0x00007ff76a34c046 in dblayer_txn_begin (..) at ldap/servers/slapd/back-ldbm/dblayer.c:3255
#4  0x00007ff76a36cc71 in ldbm_back_add (..) at ldap/servers/slapd/back-ldbm/ldbm_add.c:268
#5  0x00007ff7786e47c1 in op_shared_add (..) at ldap/servers/slapd/add.c:679
#6  0x00007ff7786e4d39 in add_internal_pb (..) at ldap/servers/slapd/add.c:407
#7  0x00007ff7786e5a71 in slapi_add_internal_pb (..) at ldap/servers/slapd/add.c:332
#8  0x00007ff7689e1b13 in write_replog_db (..) at ldap/servers/plugins/retrocl/retrocl_po.c:364
#9  0x00007ff7689e1b13 in retrocl_postob (..) at ldap/servers/plugins/retrocl/retrocl_po.c:682
#10 0x00007ff778741028 in plugin_call_func (..) at ldap/servers/slapd/plugin.c:2028
#11 0x00007ff7787412e3 in plugin_call_list (..) at ldap/servers/slapd/plugin.c:1972
#12 0x00007ff7787412e3 in plugin_call_plugins (..) at ldap/servers/slapd/plugin.c:442
#13 0x00007ff76a388ac5 in ldbm_back_modify (..) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:861
#14 0x00007ff77872d5a6 in op_shared_modify (..) at ldap/servers/slapd/modify.c:1021                                        
#15 0x00007ff77872eacb in do_modify (..) at ldap/servers/slapd/modify.c:391
#16 0x0000561a8a0a2684 in connection_dispatch_operation (..) at ldap/servers/slapd/connection.c:625
#17 0x0000561a8a0a2684 in connection_threadmain (..) at ldap/servers/slapd/connection.c:1788
#18 0x00007ff7764f4bfb in _pt_root (..) at /lib64/libnspr4.so
#19 0x00007ff775e94ea5 in start_thread (..) at /lib64/libpthread.so.0
#20 0x00007ff7755408cd in clone  at /lib64/libc.so.6

Thread 4 (Thread 0x7ff778dbb700 (LWP 28477)):
#0  0x00007ff775e9839e in pthread_rwlock_wrlock (..) at /lib64/libpthread.so.0
#1  0x00007ff7683a7639 in backend_be_pre_write_cb (..) at /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#2  0x00007ff778741028 in plugin_call_func (..) at ldap/servers/slapd/plugin.c:2028
#3  0x00007ff7787412e3 in plugin_call_list (..) at ldap/servers/slapd/plugin.c:1972
#4  0x00007ff7787412e3 in plugin_call_plugins (..) at ldap/servers/slapd/plugin.c:442
#5  0x00007ff76a37bc28 in ldbm_back_delete (..) at ldap/servers/slapd/back-ldbm/ldbm_delete.c:373
#6  0x00007ff7786f1feb in op_shared_delete (..) at ldap/servers/slapd/delete.c:324
#7  0x00007ff7786f21c8 in delete_internal_pb (..) at ldap/servers/slapd/delete.c:208
#8  0x00007ff7786f2443 in slapi_delete_internal_pb (..) at ldap/servers/slapd/delete.c:151
#9  0x00007ff7689e231e in delete_changerecord (..) at ldap/servers/plugins/retrocl/retrocl_trim.c:89
#10 0x00007ff7689e2605 in trim_changelog (..) at ldap/servers/plugins/retrocl/retrocl_trim.c:290
#11 0x00007ff7689e2605 in changelog_trim_thread_fn (..) at ldap/servers/plugins/retrocl/retrocl_trim.c:333
#12 0x00007ff7764f4bfb in _pt_root (..) at /lib64/libnspr4.so
#13 0x00007ff775e94ea5 in start_thread (..) at /lib64/libpthread.so.0
#14 0x00007ff7755408cd in clone (..) at /lib64/libc.so.6

    RetroCL being a BETXN_POSTOP, thread 19 holds the slapi-nis map rwlock (write) and then waits for retroCL backend lock.
    Thread 4 holds retroCL backend lock and then waits for slapi-nis map rwlock (write)
    The two threads are trying to acquire the locks in the opposite order

Comment 4 thierry bordaz 2019-09-11 16:24:12 UTC
As a side effect of 1435663, this bug exists since 0.56.0-12

Comment 5 thierry bordaz 2019-09-12 07:40:45 UTC
A valid workaround is to disable retroCL trimming

dn: cn=Retro Changelog Plugin,cn=plugins,cn=config
changetype: modify
replace: nsslapd-changelogmaxage
nsslapd-changelogmaxage: -1

Comment 10 Florence Blanc-Renaud 2020-02-14 15:42:08 UTC
Thank you taking your time and submitting this request for Red Hat Enterprise Linux 7. Unfortunately, this bug cannot be kept even as a stretch goal and was postponed to RHEL8.

Comment 11 thierry bordaz 2020-02-21 16:39:43 UTC
patch being pushed upstream (0.56.4) -> POST

Comment 14 thierry bordaz 2020-03-13 10:45:00 UTC
The patch is incomplete as like describe below => ASSIGNED


Thread 18 (Thread 0x7f9378fff700 (LWP 774)):
#0  0x00007f939141dd45 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007f939147cfcb in PR_EnterMonitor () at /lib64/libnspr4.so
#2  0x00007f938cfa65fe in dblayer_txn_begin () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
#3  0x00007f938cfb7a03 in ldbm_back_add () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
#4  0x00007f9391709fb5 in op_shared_add () at /usr/lib64/dirsrv/libslapd.so.0
#5  0x00007f939170a42f in add_internal_pb () at /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007f939170b1c7 in slapi_add_internal_pb () at /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007f938ce1efc5 in retrocl_postob () at /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
#8  0x00007f939176cb09 in plugin_call_func () at /usr/lib64/dirsrv/libslapd.so.0
#9  0x00007f939176cd4b in plugin_call_plugins () at /usr/lib64/dirsrv/libslapd.so.0
#10 0x00007f938cfd43d0 in ldbm_back_modify () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
#11 0x00007f9391758ac7 in op_shared_modify () at /usr/lib64/dirsrv/libslapd.so.0
#12 0x00007f9391759b0f in do_modify () at /usr/lib64/dirsrv/libslapd.so.0
#13 0x0000558aeaf2d4ed in connection_threadmain ()
#14 0x00007f9391482b34 in _pt_root () at /lib64/libnspr4.so
#15 0x00007f93914174e2 in start_thread () at /lib64/libpthread.so.0
#16 0x00007f9391291693 in clone () at /lib64/libc.so.6 

Thread 43 (Thread 0x7f9380e4f700 (LWP 43762)):
#0  0x00007f939141c1ec in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
#1  0x00007f938cdf2bd1 in map_rdlock () at /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#2  0x00007f938cde0337 in backend_write_cb.isra.0.part () at /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#3  0x00007f938cde041c in backend_betxn_pre_write_cb () at /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#4  0x00007f939176cb09 in plugin_call_func () at /usr/lib64/dirsrv/libslapd.so.0
#5  0x00007f939176cd4b in plugin_call_plugins () at /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007f938cfc6686 in ldbm_back_delete () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
#7  0x00007f939171960c in op_shared_delete () at /usr/lib64/dirsrv/libslapd.so.0
#8  0x00007f939171981e in delete_internal_pb () at /usr/lib64/dirsrv/libslapd.so.0
#9  0x00007f938ce1f83f in delete_changerecord () at /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
#10 0x00007f938ce1fa05 in changelog_trim_thread_fn () at /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
#11 0x00007f9391482b34 in _pt_root () at /lib64/libnspr4.so
#12 0x00007f93914174e2 in start_thread () at /lib64/libpthread.so.0
#13 0x00007f9391291693 in clone () at /lib64/libc.so.6 

    RetroCL being a BETXN_POSTOP, thread 18 holds the slapi-nis map rwlock (write) and then waits for retroCL backend lock.
    Thread 43 holds retroCL backend lock and then waits for slapi-nis map rwlock (write).
    The two threads are trying to acquire the locks in the opposite order

   Thread 43 waits for slapi-nis map rwlock in BETXN callback (backend_betxn_pre_write_cb).
   The original deadlock occurred with the same pattern but with BE callback (backend_be_pre_write_cb) and the fix that applied to BE also applies to BETXN :(

Comment 24 Mohammad Rizwan 2020-08-10 07:39:23 UTC
Sanity check with the following 2 tests:

version:
slapi-nis-0.56.5-2.module+el8.3.0+6580+328a3362.x86_64
ipa-server-4.8.7-8.module+el8.3.0+7513+a375844a.x86_64

1.
2020-08-08T02:26:49 [ci-vm-10-0-106-77.ho]  +-----------------------------------------------------------------------------------------+
2020-08-08T02:26:49 [ci-vm-10-0-106-77.ho]      Test:[/ipa-server/rhel80/ipa-nis-integration/root]: [ Pass(164/164): 100% ] 
2020-08-08T02:26:49 [ci-vm-10-0-106-77.ho]  +-----------------------------------------------------------------------------------------+

2.
2020-08-08T02:09:26 [ci-vm-10-0-107-225.h]  +-----------------------------------------------------------------------------------------+
2020-08-08T02:09:26 [ci-vm-10-0-107-225.h]      Test:[/ipa-server/rhel80/ds-migration-functional/root]: [ Pass(1/1): 100% ] 
2020-08-08T02:09:26 [ci-vm-10-0-107-225.h]  +-----------------------------------------------------------------------------------------+

3.
2020-08-08T02:15:40+0000 [ci-vm-10-0-107-79.ho]  +-----------------------------------------------------------------------------------------+
2020-08-08T02:15:40+0000 [ci-vm-10-0-107-79.ho]      Test:[/ipa-server/rhel80/ds-migration/root]: [ Pass(1/1): 100% ] 
2020-08-08T02:15:40+0000 [ci-vm-10-0-107-79.ho]  +-----------------------------------------------------------------------------------------+

Comment 28 errata-xmlrpc 2020-11-04 02:50:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: idm:DL1 and idm:client security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4670