528567 – Deadlock can caused by online backend maintenance operations

Bug 528567 - Deadlock can caused by online backend maintenance operations

Summary: Deadlock can caused by online backend maintenance operations

Keywords:
Status:	CLOSED DUPLICATE of bug 730387
Alias:	None
Product:	389
Classification:	Retired
Component:	Directory Server
Sub Component:
Version:	1.2.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Assignee:	Nathan Kinder
QA Contact:	Chandrasekar Kannan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	580781 (view as bug list)
Depends On:
Blocks:	389_1.3.0 512820 690319
TreeView+	depends on / blocked

Reported:	2009-10-12 20:40 UTC by Nathan Kinder
Modified:	2015-01-04 23:40 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-01-06 22:20:11 UTC
Embargoed:

Attachments	(Terms of Use)

Description Nathan Kinder 2009-10-12 20:40:20 UTC

This issue was encountered by the freeIPA project as described in bug 528209.

Due to the way we are using PR_RWLock to protect the backend structure, a deadlock can occur when an online  backend maintenance operation (such as reindexing) occurs.

The way PR_RWLock works is that a waiting writer will block any threads attempting to get a new read lock.  If you use read locks in a re-entrant manor, a request for a write lock between the two read lock calls will cause a deadlock (the writer is waiting for the reader to exit, and the reader can't get the re-entrant lock since the writer is waiting).

This deadlock can happen when a post-op plug-in is called since a read lock on the backend is already held before the plug-ins are called and not released until after the plug-ins finish.  If any of the plug-ins do some sort of internal operation on the same backend, the backend will end up being locked again by the same thread.  This typically isn't a problem, but if any other thread attempts to get a write lock on the same backend (such as the db2index task), a deadlock can occur.

Comment 1 Nathan Kinder 2009-10-12 20:46:33 UTC

Here are the backtraces of the two threads involved in this deadlock:

Writers have priority on a PR_RWLock, so this thread will be blocked by
thread 2, which is trying to get the write lock.  This thread is already holding a read lock on the backend from the initial operation that triggered the memberOf plug-in, which prevents thread 2 from getting the write lock.

Thread 5 (Thread 1484876096 (LWP 21349)):
#0  0x00000031bfa0a496 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00000031d1c2274d in PR_WaitCondVar (cvar=0x1b897e70, timeout=4294967295)
    at ../mozilla/nsprpub/pr/src/pthreads/ptsynch.c:405
#2  0x00000031d1c13f02 in PR_RWLock_Rlock (rwlock=0x1b778a30)
    at ../mozilla/nsprpub/pr/src/threads/prrwlock.c:246
#3  0x00002b348299ab77 in mtn_get_be (target_node=0x1b719830, pb=0x1b824670,
    be=0x5880ad68, index=0x5880ad8c, referral=0x5880ad60,
    errorbuf=0x5880b510 "") at ldap/servers/slapd/mapping_tree.c:2530
#4  0x00002b348299d099 in slapi_mapping_tree_select_all (pb=0x1b824670,
    be_list=0x5880b150, referral_list=0x5880ae30, errorbuf=0x5880b510 "")
    at ldap/servers/slapd/mapping_tree.c:2118
#5  0x00002b34829a3ec3 in op_shared_search (pb=0x1b824670, send_result=1)
    at ldap/servers/slapd/opshared.c:363
#6  0x00002b34829acf54 in search_internal_callback_pb (pb=0x1b824670,
    callback_data=0x5880fa10, prc=<value optimized out>,
    psec=<value optimized out>,
    prec=0x2b34829ad300 <internal_plugin_search_referral_callback>)
    at ldap/servers/slapd/plugin_internal_op.c:761
#7  0x00002b34829ad14d in search_internal_pb (pb=0x1b824670)
    at ldap/servers/slapd/plugin_internal_op.c:611
#8  0x00002b34829adc31 in slapi_search_internal_get_entry (dn=0x1c6de410,
    attrs=0x5880fb90, ret_entry=0x5880fbf0, component_identity=0x1b77b870)
    at ldap/servers/slapd/plugin_internal_op.c:891
#9  0x00002b34880888d0 in memberof_modop_one_replace_r (pb=0x1c6dde90,
    config=0x5880fd20, mod_op=0,
    group_dn=0x1c6e1b50 "cn=certificate_status,cn=taskgroups,cn=accounts,dc=example,dc=com",
    op_this=0x1c6e1b50 "cn=certificate_status,cn=taskgroups,cn=accounts,dc=example,dc=com", replace_with=0x0,
    op_to=0x1c6ef8c0 "cn=certadmin,cn=rolegroups,cn=accounts,dc=example,dc=com", stack=0x0) at ldap/servers/plugins/memberof/memberof.c:909
#10 0x00002b34880891ea in memberof_modop_one_r (pb=0x1b897e7c, config=0x0,
    mod_op=1,
    group_dn=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    op_this=0x1b72f9c0 "", op_to=<value optimized out>, stack=0x0)
    at ldap/servers/plugins/memberof/memberof.c:880
#11 0x00002b348808930e in memberof_mod_attr_list_r (pb=0x1c6dde90,
    config=0x5880fd20, mod=0,
    group_dn=0x1c6e1b50 "cn=certificate_status,cn=taskgroups,cn=accounts,dc=example,dc=com",
    op_this=0x1c6e1b50 "cn=certificate_status,cn=taskgroups,cn=accounts,dc=example,dc=com", attr=0x1c6e3fd0, stack=0x0)
    at ldap/servers/plugins/memberof/memberof.c:1315
#12 0x00002b34880893c7 in memberof_mod_attr_list (pb=0x1b897e7c, config=0x0,
    mod=1,
    group_dn=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    attr=0x0) at ldap/servers/plugins/memberof/memberof.c:1260
#13 0x00002b348808969a in memberof_postop_add (pb=0x1c6dde90)
    at ldap/servers/plugins/memberof/memberof.c:1339
#14 0x00002b34829a9f3d in plugin_call_func (list=0x1b781950, operation=507,
    pb=0x1c6dde90, call_one=0) at ldap/servers/slapd/plugin.c:1369
#15 0x00002b34829aa0ae in plugin_call_plugins (pb=0x1c6dde90,
    whichfunction=507) at ldap/servers/slapd/plugin.c:1331
#16 0x00002b348296baeb in op_shared_add (pb=0x1c6dde90)
    at ldap/servers/slapd/add.c:669
#17 0x00002b348296c9a7 in do_add (pb=0x1c6dde90)
    at ldap/servers/slapd/add.c:225
#18 0x0000000000412841 in connection_threadmain ()
    at ldap/servers/slapd/connection.c:487
#19 0x00000031d1c27d7d in _pt_root (arg=<value optimized out>)
    at ../mozilla/nsprpub/pr/src/pthreads/ptthread.c:221
#20 0x00000031bfa062f7 in start_thread () from /lib64/libpthread.so.0
#21 0x00000031beed1b6d in clone () from /lib64/libc.so.6

This thread is attempting to get the write lock to perform a reindexing task.  This blocks thread 5 from getting the second read lock, which in turn blocks this thread since thread 5 won't release it's first read lock.

Thread 2 (Thread 1516345664 (LWP 21592)):
#0  0x00000031bfa0a496 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00000031d1c2274d in PR_WaitCondVar (cvar=0x1b899030, timeout=4294967295)
    at ../mozilla/nsprpub/pr/src/pthreads/ptsynch.c:405
#2  0x00000031d1c13eb6 in PR_RWLock_Wlock (rwlock=0x1b778a30)
    at ../mozilla/nsprpub/pr/src/threads/prrwlock.c:298
#3  0x00002b3487bee27d in instance_set_busy_and_readonly (inst=0x1b8975d0)
    at ldap/servers/slapd/back-ldbm/misc.c:171
#4  0x00002b3487bea90f in ldbm_back_ldbm2index (pb=0x1c6eb370)
    at ldap/servers/slapd/back-ldbm/ldif2ldbm.c:1423
#5  0x00002b34829c82b0 in task_index_thread (arg=<value optimized out>)
    at ldap/servers/slapd/task.c:1531
#6  0x00000031d1c27d7d in _pt_root (arg=<value optimized out>)
    at ../mozilla/nsprpub/pr/src/pthreads/ptthread.c:221
#7  0x00000031bfa062f7 in start_thread () from /lib64/libpthread.so.0
#8  0x00000031beed1b6d in clone () from /lib64/libc.so.6

Comment 2 Nathan Kinder 2009-10-12 20:57:42 UTC

I'm going to make a proposal to the NSPR maintainers that we modify PR_RWLock behave differently when a re-entrant read lock is made.  If a thread already holds a read lock and tries to get another readlock, this should be allowed, even if a writer is waiting on the write lock.  Any other threads attempting to get a read lock will have to wait on the writer since it is given priority.

This approach would prevent the writer from being starved due to many active readers, yet it would also allow for safe re-entrant use of read locks without chance of a deadlock.

I am hoping that this proposal is accepted since I have been told that it is safe to make re-entrant read locks when using PR_RWLock by one of the NSPR maintainers (Wan-Teh Chang).  I believe this deadlock to be a corner case that was simply missed by the NSPR developers.

Comment 3 Nathan Kinder 2009-11-05 18:41:01 UTC

I have filed a bug against NSPR for this issue:

https://bugzilla.mozilla.org/show_bug.cgi?id=526805

Comment 4 Rich Megginson 2010-04-09 15:18:24 UTC

*** Bug 580781 has been marked as a duplicate of this bug. ***

Comment 6 Martin Kosek 2012-01-04 13:41:47 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/100

Comment 7 Rich Megginson 2012-01-06 22:20:11 UTC


*** This bug has been marked as a duplicate of bug 730387 ***

Note You need to log in before you can comment on or make changes to this bug.