570667 – MMR: simultaneous total updates on the masters cause deadlock and data loss

Bug 570667 - MMR: simultaneous total updates on the masters cause deadlock and data loss

Summary: MMR: simultaneous total updates on the masters cause deadlock and data loss

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	389
Classification:	Retired
Component:	Replication - General
Sub Component:
Version:	1.2.6
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	434914
TreeView+	depends on / blocked

Reported:	2010-03-05 00:00 UTC by Noriko Hosoi
Modified:	2015-12-07 17:13 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-07 17:13:20 UTC
Embargoed:

Attachments	(Terms of Use)
scrept to reproduce the problem (1.56 KB, text/plain) 2010-03-05 00:00 UTC, Noriko Hosoi	no flags	Details
git patch file (4.93 KB, patch) 2010-03-05 01:57 UTC, Noriko Hosoi	rmeggins: review+	Details \| Diff
git patch file (5.65 KB, patch) 2010-03-05 18:18 UTC, Noriko Hosoi	nhosoi: review? rmeggins: review+	Details \| Diff
Show Obsolete (1) View All

Description Noriko Hosoi 2010-03-05 00:00:31 UTC

Created attachment 397954 [details]
scrept to reproduce the problem

Description of problem:
Set up 2-way MMR (slapd-m0: port 10390, slapd-m1: port 10391)
Run the attached script against the 2 masters, which makes the 2 masters deadlock for a while.  Once timed out, the backend db is wiped out.

Comment 1 Noriko Hosoi 2010-03-05 00:19:11 UTC

Deadlocked threads on slapd-m0.  
(slapd-m1 has its symmetrical deadlocked threads)

Thread 36: supplier thread which is trying to acquire a replica on slapd-m1 to initialize the replica
Thread 16: consumer thread which is going to disable the backend -- by stopping the protocol to run bulk import.  Stopping protocol is hanging since the supplier thread 36 is waiting for the response from the consumer slapd-m1.

Stopping protocol eventually timed out (600 sec).  The bulk import code on the consumer deletes the replicated backend.  The supplier side also resumes and tries to send the data to the consumer, but the backend is already removed and the total update fails with the db wiped out.

Thread 36 (Thread 0x7f22bc8c7910 (LWP 387)):
#0  0x00000038068d6fc2 in select () from /lib64/libc.so.6
#1  0x00007f22c605dbf7 in DS_Sleep (ticks=1024)
    at ldap/servers/slapd/util.c:726
#2  0x00007f22c1823f8e in conn_read_result_ex (conn=0x2616850,
    retoidp=0x7f22bc8c6ea0, retdatap=0x7f22bc8c6ea8, returned_controls=0x0,
    message_id=0x0, block=1)
    at ldap/servers/plugins/replication/repl5_connection.c:345
#3  0x00007f22c182ed34 in acquire_replica (prp=0x260eb50,
    prot_oid=0x7f22c1866b00 "2.16.840.1.113730.3.6.2", ruv=0x0)
    at ldap/servers/plugins/replication/repl5_protocol_util.c:237
#4  0x00007f22c183c9ec in repl5_tot_run (prp=0x260eb50)
    at ldap/servers/plugins/replication/repl5_tot_protocol.c:351
#5  0x00007f22c182e2ce in prot_thread_main (arg=0x25dc4b0)
    at ldap/servers/plugins/replication/repl5_protocol.c:364
#6  0x0000003816029263 in _pt_root (arg=<value optimized out>)
    at ../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:228
#7  0x000000380740685a in start_thread (arg=<value optimized out>)
    at pthread_create.c:297
#8  0x00000038068de22d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9  0x0000000000000000 in ?? ()
Current language:  auto; currently asm

Thread 16 (Thread 0x7f22a6ffb910 (LWP 408)):
#0  0x00000038068d6fc2 in select () from /lib64/libc.so.6
#1  0x00007f22c605dbf7 in DS_Sleep (ticks=1000)
    at ldap/servers/slapd/util.c:726
#2  0x00007f22c183cfa0 in repl5_tot_stop (prp=0x260eb50)
    at ldap/servers/plugins/replication/repl5_tot_protocol.c:509
#3  0x00007f22c182e48c in prot_stop (rp=0x25dc4b0)
    at ldap/servers/plugins/replication/repl5_protocol.c:442
#4  0x00007f22c181eeae in agmt_stop (ra=0x26135a0)
    at ldap/servers/plugins/replication/repl5_agmt.c:651
#5  0x00007f22c1835363 in start_agreements_for_replica (r=0x242a860, start=0)
    at ldap/servers/plugins/replication/repl5_replica.c:3310
#6  0x00007f22c1835182 in replica_disable_replication (r=0x242a860,
    r_obj=0x25dee70) at ldap/servers/plugins/replication/repl5_replica.c:3253
#7  0x00007f22c182dd5f in multimaster_be_state_change (handle=0x7f22c182dc74,
    be_name=0x7f224c000db0 "userRoot", old_be_state=1, new_be_state=2)
    at ldap/servers/plugins/replication/repl5_plugins.c:1411
#8  0x00007f22c6017da0 in mtn_be_state_change (
    be_name=0x7f224c000db0 "userRoot", old_state=1, new_state=2)
    at ldap/servers/slapd/mapping_tree.c:233
#9  0x00007f22c601dc42 in mtn_internal_be_set_state (be=0x242a5e0, state=2)
    at ldap/servers/slapd/mapping_tree.c:3334
#10 0x00007f22c601dc8c in slapi_mtn_be_disable (be=0x242a5e0)
    at ldap/servers/slapd/mapping_tree.c:3372
#11 0x00007f22c1ab8830 in bulk_import_start (pb=0x23c1490)
    at ldap/servers/slapd/back-ldbm/import-threads.c:1744
#12 0x00007f22c1ab91f3 in ldbm_back_wire_import (pb=0x23c1490)
    at ldap/servers/slapd/back-ldbm/import-threads.c:2000
#13 0x00007f22c5fe2f64 in process_bulk_import_op (pb=0x23c1490, state=1, e=0x0)
    at ldap/servers/slapd/bulk_import.c:176
#14 0x00007f22c5fe2cbe in slapi_start_bulk_import (pb=0x23c1490)
    at ldap/servers/slapd/bulk_import.c:76
#15 0x00007f22c181be9e in multimaster_extop_StartNSDS50ReplicationRequest (
    pb=0x23c1490) at ldap/servers/plugins/replication/repl_extop.c:835
#16 0x00007f22c602d879 in plugin_call_exop_plugins (pb=0x23c1490,
    oid=0x7f224c000900 "2.16.840.1.113730.3.5.3")
    at ldap/servers/slapd/plugin.c:441
#17 0x000000000041b1b7 in do_extended (pb=0x23c1490)
    at ldap/servers/slapd/extendop.c:349
#18 0x0000000000413340 in connection_dispatch_operation (conn=0x7f22bc07fb38,
    op=0x2683fe0, pb=0x23c1490) at ldap/servers/slapd/connection.c:617
#19 0x00000000004148ca in connection_threadmain ()
    at ldap/servers/slapd/connection.c:2267
#20 0x0000003816029263 in _pt_root (arg=<value optimized out>)
    at ../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:228
#21 0x000000380740685a in start_thread (arg=<value optimized out>)
    at pthread_create.c:297
#22 0x00000038068de22d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#23 0x0000000000000000 in ?? ()

Comment 2 Noriko Hosoi 2010-03-05 01:57:36 UTC

Created attachment 397962 [details]
git patch file

Description: In the MMR topology, if a master receives a total
update request to initialize the other master and being initialized
by the other master at the same time, the 2 replication threads hang
and the replicated backend instance could be wiped out.

To prevent the server running the total update supplier and the
consumer at the same time, REPLICA_TOTAL_EXCLUSIVE bit has
been introduced and set to the replica state flag by the either
operation.  Once the bit is detected, the other operation fails with
the error.

Files:
 ldap/servers/plugins/replication/repl5.h
 ldap/servers/plugins/replication/repl5_protocol.c
 ldap/servers/plugins/replication/repl_extop.c

Comment 3 Noriko Hosoi 2010-03-05 18:18:39 UTC

Created attachment 398089 [details]
git patch file

I revised the previous patch to allow sending simultaneous total updates against other replicas.  I think it's no need to disallow it.

Comment 4 Noriko Hosoi 2010-03-08 23:15:53 UTC

Reviewed by Rich (Thank you!!)

Pushed to master as well as to Directory_Server_8_2_Branch.

Note You need to log in before you can comment on or make changes to this bug.