Bug 539618 - Replication bulk import reports Invalid read/write
Summary: Replication bulk import reports Invalid read/write
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: 389
Classification: Retired
Component: Replication - General
Version: 1.3.0
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Noriko Hosoi
QA Contact: Viktor Ashirov
URL:
Whiteboard:
Depends On:
Blocks: 434914 389_1.2.6
TreeView+ depends on / blocked
 
Reported: 2009-11-20 17:50 UTC by Noriko Hosoi
Modified: 2015-12-07 16:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-12-07 16:38:12 UTC
Embargoed:


Attachments (Terms of Use)
git patch file (5.58 KB, patch)
2010-02-22 17:01 UTC, Noriko Hosoi
rmeggins: review+
Details | Diff
git patch file (6.06 KB, patch)
2010-03-03 18:47 UTC, Noriko Hosoi
nhosoi: review?
rmeggins: review+
Details | Diff

Description Noriko Hosoi 2009-11-20 17:50:23 UTC
Description of problem:

This is one of the outputs from valgrind run with TET.  It looks if bulk import starts while prot_thread_main is running, bulk import disables replication which frees Repl_Protocol although prot_thread_main is still referring it.

==21973== Thread 45:
==21973== Invalid read of size 4
==21973==    at 0xB4D7F6B: prot_thread_main (repl5_protocol.c:338)
==21973==    by 0x32B7429262: (within /lib64/libnspr4.so)
==21973==    by 0x37A6406869: start_thread (in /lib64/libpthread-2.10.1.so)
==21973==    by 0x37A58DE3BC: clone (in /lib64/libc-2.10.1.so)
==21973==  Address 0xce75594 is 52 bytes inside a block of size 64 free'd
==21973==    at 0x4A0633D: free (vg_replace_malloc.c:323)
==21973==    by 0x4C50845: slapi_ch_free (ch_malloc.c:363)
==21973==    by 0xB4D7D49: prot_free (repl5_protocol.c:205)
==21973==    by 0xB4C8B6E: agmt_stop (repl5_agmt.c:648)
==21973==    by 0xB4DEFE2: start_agreements_for_replica (repl5_replica.c:3310)
==21973==    by 0xB4DEE01: replica_disable_replication (repl5_replica.c:3253)
==21973==    by 0xB4D79EE: multimaster_be_state_change (repl5_plugins.c:1411)
==21973==    by 0x4C841C7: mtn_be_state_change (mapping_tree.c:233)
==21973==    by 0x4C89EF9: mtn_internal_be_set_state (mapping_tree.c:3291)
==21973==    by 0x4C89F43: slapi_mtn_be_disable (mapping_tree.c:3329)
==21973==    by 0xB234228: bulk_import_start (import-threads.c:1745)
==21973==    by 0xB234BEB: ldbm_back_wire_import (import-threads.c:2001)
==21973==    by 0x4C4F55B: process_bulk_import_op (bulk_import.c:176)
==21973==    by 0x4C4F2B5: slapi_start_bulk_import (bulk_import.c:76)
==21973==    by 0xB4C5B2D: multimaster_extop_StartNSDS50ReplicationRequest (repl_extop.c:835)
==21973==    by 0x4C99558: plugin_call_exop_plugins (plugin.c:424)
==21973==    by 0x41B1A2: do_extended (extendop.c:349)
==21973==    by 0x413329: connection_dispatch_operation (connection.c:616)
==21973==    by 0x4148B3: connection_threadmain (connection.c:2266)
==21973==    by 0x32B7429262: (within /lib64/libnspr4.so)
==21973==    by 0x37A6406869: start_thread (in /lib64/libpthread-2.10.1.so)
==21973==    by 0x37A58DE3BC: clone (in /lib64/libc-2.10.1.so)

==21973== Invalid write of size 4
==21973==    at 0xB4D7F72: prot_thread_main (repl5_protocol.c:338)
==21973==    by 0x32B7429262: (within /lib64/libnspr4.so)
==21973==    by 0x37A6406869: start_thread (in /lib64/libpthread-2.10.1.so)
==21973==    by 0x37A58DE3BC: clone (in /lib64/libc-2.10.1.so)
==21973==  Address 0xce75590 is 48 bytes inside a block of size 64 free'd
==21973==    at 0x4A0633D: free (vg_replace_malloc.c:323)
==21973==    by 0x4C50845: slapi_ch_free (ch_malloc.c:363)
==21973==    by 0xB4D7D49: prot_free (repl5_protocol.c:205)
==21973==    by 0xB4C8B6E: agmt_stop (repl5_agmt.c:648)
==21973==    by 0xB4DEFE2: start_agreements_for_replica (repl5_replica.c:3310)
==21973==    by 0xB4DEE01: replica_disable_replication (repl5_replica.c:3253)
==21973==    by 0xB4D79EE: multimaster_be_state_change (repl5_plugins.c:1411)
==21973==    by 0x4C841C7: mtn_be_state_change (mapping_tree.c:233)
==21973==    by 0x4C89EF9: mtn_internal_be_set_state (mapping_tree.c:3291)
==21973==    by 0x4C89F43: slapi_mtn_be_disable (mapping_tree.c:3329)
==21973==    by 0xB234228: bulk_import_start (import-threads.c:1745)
==21973==    by 0xB234BEB: ldbm_back_wire_import (import-threads.c:2001)
==21973==    by 0x4C4F55B: process_bulk_import_op (bulk_import.c:176)
==21973==    by 0x4C4F2B5: slapi_start_bulk_import (bulk_import.c:76)
==21973==    by 0xB4C5B2D: multimaster_extop_StartNSDS50ReplicationRequest (repl_extop.c:835)
==21973==    by 0x4C99558: plugin_call_exop_plugins (plugin.c:424)
==21973==    by 0x41B1A2: do_extended (extendop.c:349)
==21973==    by 0x413329: connection_dispatch_operation (connection.c:616)
==21973==    by 0x4148B3: connection_threadmain (connection.c:2266)
==21973==    by 0x32B7429262: (within /lib64/libnspr4.so)
==21973==    by 0x37A6406869: start_thread (in /lib64/libpthread-2.10.1.so)
==21973==    by 0x37A58DE3BC: clone (in /lib64/libc-2.10.1.so)
==21973==

==21973== Invalid read of size 4
==21973==    at 0xB4D7E91: prot_thread_main (repl5_protocol.c:305)
==21973==    by 0x32B7429262: (within /lib64/libnspr4.so)
==21973==    by 0x37A6406869: start_thread (in /lib64/libpthread-2.10.1.so)
==21973==    by 0x37A58DE3BC: clone (in /lib64/libc-2.10.1.so)
==21973==  Address 0xce75590 is 48 bytes inside a block of size 64 free'd
==21973==    at 0x4A0633D: free (vg_replace_malloc.c:323)
==21973==    by 0x4C50845: slapi_ch_free (ch_malloc.c:363)
==21973==    by 0xB4D7D49: prot_free (repl5_protocol.c:205)
==21973==    by 0xB4C8B6E: agmt_stop (repl5_agmt.c:648)
==21973==    by 0xB4DEFE2: start_agreements_for_replica (repl5_replica.c:3310)
==21973==    by 0xB4DEE01: replica_disable_replication (repl5_replica.c:3253)
==21973==    by 0xB4D79EE: multimaster_be_state_change (repl5_plugins.c:1411)
==21973==    by 0x4C841C7: mtn_be_state_change (mapping_tree.c:233)
==21973==    by 0x4C89EF9: mtn_internal_be_set_state (mapping_tree.c:3291)
==21973==    by 0x4C89F43: slapi_mtn_be_disable (mapping_tree.c:3329)
==21973==    by 0xB234228: bulk_import_start (import-threads.c:1745)
==21973==    by 0xB234BEB: ldbm_back_wire_import (import-threads.c:2001)
==21973==    by 0x4C4F55B: process_bulk_import_op (bulk_import.c:176)
==21973==    by 0x4C4F2B5: slapi_start_bulk_import (bulk_import.c:76)
==21973==    by 0xB4C5B2D: multimaster_extop_StartNSDS50ReplicationRequest (repl_extop.c:835)
==21973==    by 0x4C99558: plugin_call_exop_plugins (plugin.c:424)
==21973==    by 0x41B1A2: do_extended (extendop.c:349)
==21973==    by 0x413329: connection_dispatch_operation (connection.c:616)
==21973==    by 0x4148B3: connection_threadmain (connection.c:2266)
==21973==    by 0x32B7429262: (within /lib64/libnspr4.so)
==21973==    by 0x37A6406869: start_thread (in /lib64/libpthread-2.10.1.so)
==21973==    by 0x37A58DE3BC: clone (in /lib64/libc-2.10.1.so)

Comment 2 Noriko Hosoi 2010-02-19 21:18:25 UTC
Another Invalid read from the same test suite:
==9077== Thread 16:
==9077== Invalid read of size 4
==9077==    at 0xA17D021: ldbm_back_modrdn (ldbm_modrdn.c:938)
==9077==    by 0x4C8E645: op_shared_rename (modrdn.c:500)
==9077==    by 0x4C8DA6A: do_modrdn (modrdn.c:152)
==9077==    by 0x413265: connection_dispatch_operation (connection.c:554)
==9077==    by 0x4149A1: connection_threadmain (connection.c:2267)
==9077==    by 0x3816029262: (within /lib64/libnspr4.so)
==9077==    by 0x3807406859: start_thread (in /lib64/libpthread-2.10.2.so)
==9077==    by 0x38068DE22C: clone (in /lib64/libc-2.10.2.so)
==9077==  Address 0x18 is not stack'd, malloc'd or (recently) free'd

Comment 3 Noriko Hosoi 2010-02-22 17:01:00 UTC
Created attachment 395511 [details]
git patch file

Files:
 ldap/servers/plugins/replication/repl5.h
 ldap/servers/plugins/replication/repl5_agmt.c
 ldap/servers/plugins/replication/repl5_protocol.c
 ldap/servers/slapd/back-ldbm/ldbm_modrdn.c

Descriptions: When a protocol is freed by prot_free, prot_close
is supposed to have been called to stop the main thread
prot_thread_main.  But, there was no mechanism for the freeing
thread whether the prot_thread_main has already quitted or not,
it could have released the Repl_Protocol even though it was
still being in use.  This fix is adding a checking method.

The same test revealed ldbm_back_modrdn had a chance to access
a field of NULL entry structure.

Comment 4 Rich Megginson 2010-02-22 17:21:12 UTC
Comment on attachment 395511 [details]
git patch file

Is there a chance the state could be set to some state other than PROT_DONE and the loop will go on forever?

Comment 5 Noriko Hosoi 2010-02-22 17:45:35 UTC
(In reply to comment #4)
> (From update of attachment 395511 [details])
> Is there a chance the state could be set to some state other than PROT_DONE and
> the loop will go on forever?    

A good news is the protocol state is changed only at the end of the while loop in prot_thread_main.  When the while loop is done and it returns from prot_thread_main, STATE_DONE is set to the state.  So, unless the thread prot_thread_main is "killed" in the middle of the loop and the rest of the process remains, it looks there is no chance to have other state unchanged forever...  But do you think it's too strict?  I can add a max loop count to escape from the while loop introduced to prot_free.

Comment 6 Rich Megginson 2010-02-22 17:58:33 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > (From update of attachment 395511 [details] [details])
> > Is there a chance the state could be set to some state other than PROT_DONE and
> > the loop will go on forever?    
> 
> A good news is the protocol state is changed only at the end of the while loop
> in prot_thread_main.  When the while loop is done and it returns from
> prot_thread_main, STATE_DONE is set to the state.  So, unless the thread
> prot_thread_main is "killed" in the middle of the loop and the rest of the
> process remains, it looks there is no chance to have other state unchanged
> forever...  But do you think it's too strict?  I can add a max loop count to
> escape from the while loop introduced to prot_free.    

No.  I think it's ok to commit.  But we'll have to do some repl stress testing (like repeatedly doing a repl init or doing something to force incremental update) to make sure we don't get stuck in a loop.

Comment 7 Noriko Hosoi 2010-02-22 18:04:59 UTC
(In reply to comment #6)
> No.  I think it's ok to commit.  But we'll have to do some repl stress testing
> (like repeatedly doing a repl init or doing something to force incremental
> update) to make sure we don't get stuck in a loop.    

I agree with you.  I'm going to add such a test.
Thanks!

Comment 8 Noriko Hosoi 2010-02-22 18:27:52 UTC
Reviewed by Rich (Thank you!!)

Pushed to master.

$ git merge replication
Updating 79e9309..4205086
Fast forward
 ldap/servers/plugins/replication/repl5.h          |    2 +-
 ldap/servers/plugins/replication/repl5_agmt.c     |    8 ++--
 ldap/servers/plugins/replication/repl5_protocol.c |   34 +++++++++++++++++++--
 ldap/servers/slapd/back-ldbm/ldbm_modrdn.c        |    2 +-
 4 files changed, 37 insertions(+), 9 deletions(-)
$ git push
Counting objects: 23, done.
Delta compression using 4 threads.
Compressing objects: 100% (12/12), done.
Writing objects: 100% (12/12), 1.89 KiB, done.
Total 12 (delta 10), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   79e9309..4205086  master -> master

Pushed to Directory_Server_8_2_Branch (except ldbm_modrdn.c)

$ git push origin ds82-local:Directory_Server_8_2_Branch
Counting objects: 17, done.
Delta compression using 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 8.92 KiB, done.
Total 9 (delta 6), reused 2 (delta 1)
To ssh://git.fedorahosted.org/git/389/ds.git
   be07459..bea4eb5  ds82-local -> Directory_Server_8_2_Branch

Comment 9 Noriko Hosoi 2010-03-03 18:34:59 UTC
Unfortunately, the fix I introduced could cause this deadlock.

Thread 10 (Thread 0xaf2fcb70 (LWP 29174)):
#0  0x002cb416 in __kernel_vsyscall ()
#1  0x001e33b1 in select () from /lib/libc.so.6
#2  0x0089e3d1 in DS_Sleep (ticks=1000) at ldap/servers/slapd/util.c:726
#3  0x0046a737 in prot_free (rpp=0xb3d04f4c, wait_for_done=1)
    at ldap/servers/plugins/replication/repl5_protocol.c:210
#4  0x0045af81 in agmt_stop (ra=0xb3d04f18)
    at ldap/servers/plugins/replication/repl5_agmt.c:648
#5  0x00471c81 in start_agreements_for_replica (r=0xb11025a8, start=0)
    at ldap/servers/plugins/replication/repl5_replica.c:3310
#6  0x00471a7d in replica_disable_replication (r=0xb11025a8, r_obj=0xb11026b8)
    at ldap/servers/plugins/replication/repl5_replica.c:3253
#7  0x0046a473 in multimaster_be_state_change (handle=0x46a375,
    be_name=0xb3d02008 "userRoot", old_be_state=1, new_be_state=2)
    at ldap/servers/plugins/replication/repl5_plugins.c:1411
#8  0x0085bd88 in mtn_be_state_change (be_name=0xb3d02008 "userRoot",
    old_state=1, new_state=2) at ldap/servers/slapd/mapping_tree.c:233
#9  0x008613cb in mtn_internal_be_set_state (be=0x95ec1a0, state=2)
    at ldap/servers/slapd/mapping_tree.c:3334
#10 0x0086141c in slapi_mtn_be_disable (be=0x95ec1a0)
    at ldap/servers/slapd/mapping_tree.c:3372
#11 0x0096f62f in bulk_import_start (pb=0x9731e68)
    at ldap/servers/slapd/back-ldbm/import-threads.c:1744
#12 0x0096ff52 in ldbm_back_wire_import (pb=0x9731e68)
    at ldap/servers/slapd/back-ldbm/import-threads.c:2000
#13 0x00828b08 in process_bulk_import_op (pb=0x9731e68, state=1, e=0x0)
    at ldap/servers/slapd/bulk_import.c:176
#14 0x00828869 in slapi_start_bulk_import (pb=0x9731e68)
    at ldap/servers/slapd/bulk_import.c:76
#15 0x00457aa7 in multimaster_extop_StartNSDS50ReplicationRequest (
    pb=0x9731e68) at ldap/servers/plugins/replication/repl_extop.c:835
#16 0x00870a97 in plugin_call_exop_plugins (pb=0x9731e68,
    oid=0xb3d08308 "2.16.840.1.113730.3.5.3")
    at ldap/servers/slapd/plugin.c:441
#17 0x0805ebb1 in do_extended (pb=0x9731e68)
    at ldap/servers/slapd/extendop.c:349
#18 0x08057699 in connection_dispatch_operation (conn=0xb3e47308,
    op=0x9734168, pb=0x9731e68) at ldap/servers/slapd/connection.c:617
#19 0x08058b48 in connection_threadmain ()
    at ldap/servers/slapd/connection.c:2267
#20 0x06ccdb42 in ?? () from /usr/lib/libnspr4.so
#21 0x005b6ab5 in start_thread () from /lib/libpthread.so.0
#22 0x001eadce in clone () from /lib/libc.so.6

Thread 9 (Thread 0xa5cedb70 (LWP 29189)):
#0  0x002cb416 in __kernel_vsyscall ()
#1  0x005bd899 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x005b8d6d in _L_lock_979 () from /lib/libpthread.so.0
#3  0x005b8d23 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x06cc7763 in PR_Lock () from /usr/lib/libnspr4.so
#5  0x0045b302 in agmt_get_replarea (ra=0xb3d04f18)
    at ldap/servers/plugins/replication/repl5_agmt.c:760
#6  0x0045dea9 in get_agmt_status (pb=0x976b8b0, e=0x9772b88, entryAfter=0x0,
    returncode=0xa5ce84c8, returntext=0xa5ce82c8 "", arg=0xb3d04f18)
    at ldap/servers/plugins/replication/repl5_agmt.c:2116
#7  0x00835c55 in dse_call_callback (pdse=0x9555b00, pb=0x976b8b0,
    operation=4, flags=1, entryBefore=0x9772b88, entryAfter=0x0,
    returncode=0xa5ce84c8, returntext=0xa5ce82c8 "")
    at ldap/servers/slapd/dse.c:2232
#8  0x008344d9 in dse_search (pb=0x976b8b0) at ldap/servers/slapd/dse.c:1710
#9  0x008695a4 in op_shared_search (pb=0x976b8b0, send_result=1)
    at ldap/servers/slapd/opshared.c:638
#10 0x0806dcc3 in do_search (pb=0x976b8b0) at ldap/servers/slapd/search.c:350
#11 0x080575bd in connection_dispatch_operation (conn=0xb3e47008,
    op=0x97a1010, pb=0x976b8b0) at ldap/servers/slapd/connection.c:587
#12 0x08058b48 in connection_threadmain ()
    at ldap/servers/slapd/connection.c:2267
#13 0x06ccdb42 in ?? () from /usr/lib/libnspr4.so
#14 0x005b6ab5 in start_thread () from /lib/libpthread.so.0
#15 0x001eadce in clone () from /lib/libc.so.6

Thread 3 (Thread 0x97c35b70 (LWP 29556)):
#0  0x002cb416 in __kernel_vsyscall ()
#1  0x005bd899 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x005b8d6d in _L_lock_979 () from /lib/libpthread.so.0
#3  0x005b8d23 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x06cc7763 in PR_Lock () from /usr/lib/libnspr4.so
#5  0x0045cad6 in agmt_notify_change (agmt=0xb3d04f18, pb=0xb2700dc0)
    at ldap/servers/plugins/replication/repl5_agmt.c:1575
#6  0x0045fa06 in agmtlist_notify_all (pb=0xb2700dc0)
    at ldap/servers/plugins/replication/repl5_agmtlist.c:664
#7  0x00469cf1 in process_postop (pb=0xb2700dc0)
    at ldap/servers/plugins/replication/repl5_plugins.c:1105
#8  0x004695c4 in multimaster_postop_modify (pb=0xb2700dc0)
    at ldap/servers/plugins/replication/repl5_plugins.c:828
#9  0x00872270 in plugin_call_func (list=0x958d228, operation=521,
    pb=0xb2700dc0, call_one=0) at ldap/servers/slapd/plugin.c:1417
#10 0x0087214b in plugin_call_list (list=0x958fb60, operation=521,
    pb=0xb2700dc0) at ldap/servers/slapd/plugin.c:1379
#11 0x008708e4 in plugin_call_plugins (pb=0xb2700dc0, whichfunction=521)
    at ldap/servers/slapd/plugin.c:372
#12 0x00863a71 in op_shared_modify (pb=0xb2700dc0, pw_change=0, old_pw=0x0)
    at ldap/servers/slapd/modify.c:863
#13 0x00862de9 in modify_internal_pb (pb=0xb2700dc0)
    at ldap/servers/slapd/modify.c:526
#14 0x00862a5d in slapi_modify_internal_pb (pb=0xb2700dc0)
    at ldap/servers/slapd/modify.c:416
#15 0x0045cdba in agmt_replica_init_done (agmt=0xb3d04f18)
    at ldap/servers/plugins/replication/repl5_agmt.c:1699
#16 0x0046aa90 in prot_thread_main (arg=0xb27eda18)
    at ldap/servers/plugins/replication/repl5_protocol.c:358
#17 0x06ccdb42 in ?? () from /usr/lib/libnspr4.so
#18 0x005b6ab5 in start_thread () from /lib/libpthread.so.0
#19 0x001eadce in clone () from /lib/libc.so.6

Comment 10 Noriko Hosoi 2010-03-03 18:47:05 UTC
Created attachment 397633 [details]
git patch file

Description:
Back off this commit:
  commit 4205086e4f237a52eb9113cd95f9cf87b39e9ed4
  Date:   Mon Feb 22 08:49:49 2010 -0800
since this change could cause the deadlock between the thread
eventually calling prot_free, which acquired the agreement lock,
and other threads waiting for the agreement lock, which prevents
the protocol stop.

Instead of waiting for prot_thread_main done in prot_free, let
prot_thread_main check the existence of the protocol field in
the agreement.  If it's not available, prot_thread_main quits.

Files:
 ldap/servers/plugins/replication/repl5.h
 ldap/servers/plugins/replication/repl5_agmt.c
 ldap/servers/plugins/replication/repl5_protocol.c

Comment 11 Noriko Hosoi 2010-03-03 22:36:40 UTC
Reviewed by Rich (Thank you!!)

Pushed to master and Directory_Server_8_2_Branch.

$ git merge 539618
Updating e8f5064..be57c97
Fast forward
 ldap/servers/plugins/replication/repl5.h          |    2 +-
 ldap/servers/plugins/replication/repl5_agmt.c     |   14 +++++-
 ldap/servers/plugins/replication/repl5_protocol.c |   49 ++++++--------------
 3 files changed, 27 insertions(+), 38 deletions(-)
$ git push
Counting objects: 17, done.
Delta compression using 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 1.26 KiB, done.
Total 9 (delta 7), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   e8f5064..be57c97  master -> master

$ git push origin ds82-local:Directory_Server_8_2_Branch
Counting objects: 17, done.
Delta compression using 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 14.78 KiB, done.
Total 9 (delta 6), reused 1 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   8e141ee..ca55f32  ds82-local -> Directory_Server_8_2_Branch


Note You need to log in before you can comment on or make changes to this bug.