Created attachment 397954 [details] scrept to reproduce the problem Description of problem: Set up 2-way MMR (slapd-m0: port 10390, slapd-m1: port 10391) Run the attached script against the 2 masters, which makes the 2 masters deadlock for a while. Once timed out, the backend db is wiped out.
Deadlocked threads on slapd-m0. (slapd-m1 has its symmetrical deadlocked threads) Thread 36: supplier thread which is trying to acquire a replica on slapd-m1 to initialize the replica Thread 16: consumer thread which is going to disable the backend -- by stopping the protocol to run bulk import. Stopping protocol is hanging since the supplier thread 36 is waiting for the response from the consumer slapd-m1. Stopping protocol eventually timed out (600 sec). The bulk import code on the consumer deletes the replicated backend. The supplier side also resumes and tries to send the data to the consumer, but the backend is already removed and the total update fails with the db wiped out. Thread 36 (Thread 0x7f22bc8c7910 (LWP 387)): #0 0x00000038068d6fc2 in select () from /lib64/libc.so.6 #1 0x00007f22c605dbf7 in DS_Sleep (ticks=1024) at ldap/servers/slapd/util.c:726 #2 0x00007f22c1823f8e in conn_read_result_ex (conn=0x2616850, retoidp=0x7f22bc8c6ea0, retdatap=0x7f22bc8c6ea8, returned_controls=0x0, message_id=0x0, block=1) at ldap/servers/plugins/replication/repl5_connection.c:345 #3 0x00007f22c182ed34 in acquire_replica (prp=0x260eb50, prot_oid=0x7f22c1866b00 "2.16.840.1.113730.3.6.2", ruv=0x0) at ldap/servers/plugins/replication/repl5_protocol_util.c:237 #4 0x00007f22c183c9ec in repl5_tot_run (prp=0x260eb50) at ldap/servers/plugins/replication/repl5_tot_protocol.c:351 #5 0x00007f22c182e2ce in prot_thread_main (arg=0x25dc4b0) at ldap/servers/plugins/replication/repl5_protocol.c:364 #6 0x0000003816029263 in _pt_root (arg=<value optimized out>) at ../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:228 #7 0x000000380740685a in start_thread (arg=<value optimized out>) at pthread_create.c:297 #8 0x00000038068de22d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #9 0x0000000000000000 in ?? () Current language: auto; currently asm Thread 16 (Thread 0x7f22a6ffb910 (LWP 408)): #0 0x00000038068d6fc2 in select () from /lib64/libc.so.6 #1 0x00007f22c605dbf7 in DS_Sleep (ticks=1000) at ldap/servers/slapd/util.c:726 #2 0x00007f22c183cfa0 in repl5_tot_stop (prp=0x260eb50) at ldap/servers/plugins/replication/repl5_tot_protocol.c:509 #3 0x00007f22c182e48c in prot_stop (rp=0x25dc4b0) at ldap/servers/plugins/replication/repl5_protocol.c:442 #4 0x00007f22c181eeae in agmt_stop (ra=0x26135a0) at ldap/servers/plugins/replication/repl5_agmt.c:651 #5 0x00007f22c1835363 in start_agreements_for_replica (r=0x242a860, start=0) at ldap/servers/plugins/replication/repl5_replica.c:3310 #6 0x00007f22c1835182 in replica_disable_replication (r=0x242a860, r_obj=0x25dee70) at ldap/servers/plugins/replication/repl5_replica.c:3253 #7 0x00007f22c182dd5f in multimaster_be_state_change (handle=0x7f22c182dc74, be_name=0x7f224c000db0 "userRoot", old_be_state=1, new_be_state=2) at ldap/servers/plugins/replication/repl5_plugins.c:1411 #8 0x00007f22c6017da0 in mtn_be_state_change ( be_name=0x7f224c000db0 "userRoot", old_state=1, new_state=2) at ldap/servers/slapd/mapping_tree.c:233 #9 0x00007f22c601dc42 in mtn_internal_be_set_state (be=0x242a5e0, state=2) at ldap/servers/slapd/mapping_tree.c:3334 #10 0x00007f22c601dc8c in slapi_mtn_be_disable (be=0x242a5e0) at ldap/servers/slapd/mapping_tree.c:3372 #11 0x00007f22c1ab8830 in bulk_import_start (pb=0x23c1490) at ldap/servers/slapd/back-ldbm/import-threads.c:1744 #12 0x00007f22c1ab91f3 in ldbm_back_wire_import (pb=0x23c1490) at ldap/servers/slapd/back-ldbm/import-threads.c:2000 #13 0x00007f22c5fe2f64 in process_bulk_import_op (pb=0x23c1490, state=1, e=0x0) at ldap/servers/slapd/bulk_import.c:176 #14 0x00007f22c5fe2cbe in slapi_start_bulk_import (pb=0x23c1490) at ldap/servers/slapd/bulk_import.c:76 #15 0x00007f22c181be9e in multimaster_extop_StartNSDS50ReplicationRequest ( pb=0x23c1490) at ldap/servers/plugins/replication/repl_extop.c:835 #16 0x00007f22c602d879 in plugin_call_exop_plugins (pb=0x23c1490, oid=0x7f224c000900 "2.16.840.1.113730.3.5.3") at ldap/servers/slapd/plugin.c:441 #17 0x000000000041b1b7 in do_extended (pb=0x23c1490) at ldap/servers/slapd/extendop.c:349 #18 0x0000000000413340 in connection_dispatch_operation (conn=0x7f22bc07fb38, op=0x2683fe0, pb=0x23c1490) at ldap/servers/slapd/connection.c:617 #19 0x00000000004148ca in connection_threadmain () at ldap/servers/slapd/connection.c:2267 #20 0x0000003816029263 in _pt_root (arg=<value optimized out>) at ../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:228 #21 0x000000380740685a in start_thread (arg=<value optimized out>) at pthread_create.c:297 #22 0x00000038068de22d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #23 0x0000000000000000 in ?? ()
Created attachment 397962 [details] git patch file Description: In the MMR topology, if a master receives a total update request to initialize the other master and being initialized by the other master at the same time, the 2 replication threads hang and the replicated backend instance could be wiped out. To prevent the server running the total update supplier and the consumer at the same time, REPLICA_TOTAL_EXCLUSIVE bit has been introduced and set to the replica state flag by the either operation. Once the bit is detected, the other operation fails with the error. Files: ldap/servers/plugins/replication/repl5.h ldap/servers/plugins/replication/repl5_protocol.c ldap/servers/plugins/replication/repl_extop.c
Created attachment 398089 [details] git patch file I revised the previous patch to allow sending simultaneous total updates against other replicas. I think it's no need to disallow it.
Reviewed by Rich (Thank you!!) Pushed to master as well as to Directory_Server_8_2_Branch.