Hide Forgot
Created attachment 1337536 [details] reproducer including sample LDIF and tet scripts Description of problem: In the following replication environments - users and groups are stored in different backend - memberOf plugin is enabled with memberofallbackend=on - MMR environment with fractional replication except memberofattr When deleting many uesrs who belongs multiple groups continuously, deadlock happens between deleting user and deleting the user from group members (which is triggered by memberOf plugin). Here is stack of threads which causes deadlock. Thread 18 (Thread 0x7f96e1136700 (LWP 11750)): #0 0x00007f97547086d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f9754d5f463 in PR_EnterMonitor () from /lib64/libnspr4.so #2 0x00007f974af80116 in dblayer_txn_begin () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #3 0x00007f974afbb4e8 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #4 0x00007f97569d69cb in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #5 0x00007f97569d7544 in modify_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007f974a68b525 in memberof_del_dn_type_callback () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #7 0x00007f97569febad in send_ldap_search_entry_ext () from /usr/lib64/dirsrv/libslapd.so.0 #8 0x00007f97569ff3ac in send_ldap_search_entry () from /usr/lib64/dirsrv/libslapd.so.0 #9 0x00007f97569dc091 in iterate.isra.0.constprop.3 () from /usr/lib64/dirsrv/libslapd.so.0 #10 0x00007f97569dc1da in send_results_ext.constprop.2 () from /usr/lib64/dirsrv/libslapd.so.0 #11 0x00007f97569ddc11 in op_shared_search () from /usr/lib64/dirsrv/libslapd.so.0 #12 0x00007f97569edc2e in search_internal_callback_pb () from /usr/lib64/dirsrv/libslapd.so.0 #13 0x00007f974a68a5fb in memberof_call_foreach_dn.isra.9 () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #14 0x00007f974a68b1b2 in memberof_del_dn_from_groups.isra.11 () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #15 0x00007f974a68e68d in memberof_postop_del () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #16 0x00007f97569e8c7b in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #17 0x00007f97569e8f13 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #18 0x00007f974afaceab in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #19 0x00007f975699bff0 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0 #20 0x00007f975699c372 in do_delete () from /usr/lib64/dirsrv/libslapd.so.0 #21 0x00007f97572d2972 in connection_threadmain () #22 0x00007f9754d6496b in _pt_root () from /lib64/libnspr4.so #23 0x00007f9754704dc5 in start_thread () from /lib64/libpthread.so.0 #24 0x00007f9753fe773d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f96da929700 (LWP 11763)): #0 0x00007f97547086d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f9754d5f463 in PR_EnterMonitor () from /lib64/libnspr4.so #2 0x00007f974a68d16c in memberof_lock () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #3 0x00007f974a68da49 in memberof_postop_modify () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #4 0x00007f97569e8c7b in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #5 0x00007f97569e8f13 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007f974afbb36c in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #7 0x00007f97569d69cb in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #8 0x00007f97569d7dfb in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 #9 0x00007f97572d2955 in connection_threadmain () #10 0x00007f9754d6496b in _pt_root () from /lib64/libnspr4.so #11 0x00007f9754704dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f9753fe773d in clone () from /lib64/libc.so.6 Version-Release number of selected component (if applicable): RHDS-10 389-ds-base-1.3.6.1-19 RHEL-7.4 How reproducible: delete many users by ldapmodify Steps to Reproduce: attached test data and scripts(deadlock-repeoducer.zip) 1. prepare 2 DS instances(say M1 and M2) with the following suffix ================================================================== dc=example.dc=com (parent suffix) +-- ou=People,dc=example,dc=com (Sub Suffix) +-- ou=Groups,dc=example,dc=com (Sub Suffix) 2. enable memberOf Plugin with memberofallbackend=on ===================================================== i.e. dn: cn=MemberOf Plugin,cn=plugins,cn=config objectClass: top objectClass: nsSlapdPlugin objectClass: extensibleObject cn: MemberOf Plugin nsslapd-pluginPath: libmemberof-plugin nsslapd-pluginInitfunc: memberof_postop_init nsslapd-pluginType: betxnpostoperation nsslapd-pluginEnabled: on << nsslapd-plugin-depends-on-type: database memberofgroupattr: member memberofattr: memberOf nsslapd-pluginId: memberof nsslapd-pluginVersion: 1.3.6.1 nsslapd-pluginVendor: 389 Project nsslapd-pluginDescription: memberof plugin memberofallbackends: on << 3. increased DB locks (x10) =========================== i.e dn: cn=config,cn=ldbm database,cn=plugins,cn=config ... nsslapd-db-locks: 100000 ... 4. configure 2 MMR fractional replication without memberOfattr per suffix ========================================================================== create fractional replication agreements per suffix in both masters (6 in total) e.g. --- dn: cn=p_to_m2,cn=replica,cn=ou\3DPeople\2Cdc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config objectClass: top objectClass: nsDS5ReplicationAgreement description: p_to_m2 cn: p_to_m2 nsDS5ReplicaRoot: ou=People,dc=example,dc=com nsDS5ReplicaHost: rhel71ds.example.com nsDS5ReplicaPort: 2222 nsDS5ReplicaBindDN: cn=Replication Manager,cn=replication,cn=config nsDS5ReplicaTransportInfo: LDAP nsDS5ReplicaBindMethod: SIMPLE nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE memberOf << ... --- 5. initialize DB and replication ================================ import example.ldif,people.ldif,groups.ldif to M1 and then initialize M2 6. run test scripts against M1 ============================== "deadlockTest.sh" is main test scripts which call the other scripts. please adjust the following paramater at the beginning of this script based on your M1 instance before running line 1 LOOP=$1 2 HOST="localhost" <<< 3 PORT=1111 <<< 4 ROOTDN="cn=Directory Manager" <<< 5 PASSWORD="dirmanager" <<< You can run test by specifying # of test users like : $ ./deadlockTest.sh 50 here is Test Scenario --------------------- 1. add 50 users 2. add these 50 users to 10 groups respectively 3. delete 50 users => This script might got stuck user deletion (deadlock happens with M1) If script completes without problem please check if all of user deletion are replicated to M2 (if not, deadlock happens with M2) Since this is indeed timing issue, you may need to run test scripts several times to reproduce the issue. Actual results: deadlock happens either of M1 or M2 Expected results: deadlock never happen Additional info:
If we implement this ticket below, it should resolve "this" deadlock. https://pagure.io/389-ds-base/issue/48235 However, using cross-accessed backends like this is is more likely to cause these kinds of deadlocks with plugins. I'm just worried about other plugins like Referential Integrity. Does the customer have a testing environment to try a potential hotfix?
I think customer can test the test-patch in their env but let me confirm. I have also reproduction environment with 389-ds-base-1.3.6.1-19 and can test it as well.
Build tested: 389-ds-base-1.3.7.5-18.el7.x86_64 Using reproducer from the description, I can no longer reproduce the problem. Marking as VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0811