This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 863576 - Dirsrv deadlock locking up IPA
Dirsrv deadlock locking up IPA
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
6.4
x86_64 Linux
unspecified Severity unspecified
: rc
: 6.4
Assigned To: Noriko Hosoi
Sankar Ramalingam
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-05 16:49 EDT by baiesi
Modified: 2013-02-21 03:21 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When abandoning a single paged results request occurs, the abandoning code tries to acquire connection lock twice. Consequence: Since the connection lock is not self re-entrant, it waits for the lock forever and stops the server. Fix: This patch eliminates the second lock. Result: Abandoning a single paged results does not cause the Directory Server deadlock.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 03:21:01 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
git patch file (389-ds-base-1.2.11 branch) (12.98 KB, patch)
2012-10-05 21:10 EDT, Noriko Hosoi
rmeggins: review+
Details | Diff

  None (edit)
Description baiesi 2012-10-05 16:49:21 EDT
Description of problem:
Dirsrv deadlock in abandon

Version-Release number of selected component (if applicable):
ipa-server-2.2.0-16.el6.x86_64
389-ds-base-1.2.11.13-1.el6.x86_64

How reproducible: Have not reproduced yet...

Test Env:
2 IPA Master Servers
2 IPA Clients
10K users
10 Groups 1k per group
Sudo command setup

Activities:
-Running Manually triggered admin through the UI (del sudo rules)
-Hourly running scheduled ssh sudo runtime client load (10 threads)

System running load fine but and began failing early morning 5:00AM.  System became inoperative.  Could not able to kinit, unable to restart. Cause determined to be a DirSrv deadlock.

Additional info:
Dev Noriko Hosoi indicates:

This is a self-deadlock in abandon... I see 3 threads waiting for the
same mutex. The first one thread 23 already acquired the mutex in
do_abandon, but it tries to grab it again in
pagedresults_free_one_msgid.

Thread 23 (Thread 0x7f446b458700 (LWP 22023)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x00000030014889f9 in pagedresults_free_one_msgid (conn=0x7f4470e68670,
msgid=31) at ldap/servers/slapd/pagedresults.c:272
#5 0x000000000040d4ad in do_abandon (pb=0xa56ee70)
at ldap/servers/slapd/abandon.c:156
#6 0x0000000000414104 in connection_dispatch_operation ()
at ldap/servers/slapd/connection.c:646
#7 connection_threadmain () at ldap/servers/slapd/connection.c:2338
#8 0x0000003001c299e3 in ?? () from /lib64/libnspr4.so
#9 0x0000003ffec07851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003ffe8e767d in clone () from /lib64/libc.so.6
(gdb) p conn
$6 = (Connection *) 0x7f4470e68670
(gdb) p *conn
$7 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}

Thread 19 (Thread 0x7f4468c54700 (LWP 22027)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x00000030014883fd in pagedresults_set_search_result
(conn=0x7f4470e68670,
sr=0x0, locked=0, index=0) at ldap/servers/slapd/pagedresults.c:360
#5 0x00007f447b1819fc in ldbm_back_next_search_entry_ext (pb=0xb02fca0,
use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1426
#6 0x00000030014854c1 in iterate (pb=0xb02fca0, be=<value optimized out>,
pnentries=0x7f4468c51394, pagesize=1000, pr_statp=0x7f4468c51388,
send_result=1) at ldap/servers/slapd/opshared.c:1250
#7 0x00000030014859d7 in send_results_ext (pb=0xb02fca0,
nentries=0x7f4468c51394, pagesize=1000, pr_stat=0x7f4468c51388,
send_result=1) at ldap/servers/slapd/opshared.c:1651
#8 0x00000030014866cb in op_shared_search (pb=0xb02fca0, send_result=1)
at ldap/servers/slapd/opshared.c:828
#9 0x00000000004263a4 in do_search (pb=<value optimized out>)
at ldap/servers/slapd/search.c:400
#10 0x000000000041426a in connection_dispatch_operation ()
at ldap/servers/slapd/connection.c:621
#11 connection_threadmain () at ldap/servers/slapd/connection.c:2338
#12 0x0000003001c299e3 in ?? () from /lib64/libnspr4.so
#13 0x0000003ffec07851 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003ffe8e767d in clone () from /lib64/libc.so.6
(gdb) p conn
$4 = (Connection *) 0x7f4470e68670
(gdb) p *conn
$5 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}

Thread 1 (Thread 0x7f44818d87c0 (LWP 21797)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x0000000000417ef6 in setup_pr_read_pds (ports=0x7fff13810910)
at ldap/servers/slapd/daemon.c:1675
#5 slapd_daemon (ports=0x7fff13810910) at ldap/servers/slapd/daemon.c:1144
#6 0x000000000041f22f in main (argc=7, argv=0x7fff13810ca8)
at ldap/servers/slapd/main.c:1253
(gdb) p c
$1 = (Connection *) 0x7f4470e68670
(gdb) p *c
$2 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}
Comment 1 Noriko Hosoi 2012-10-05 21:10:57 EDT
Created attachment 622544 [details]
git patch file (389-ds-base-1.2.11 branch)

Bug Description: Abandon of a Simple Paged Results request causes
the self deadlock.  When abandoning a simple paged result request,
the mutex for the connection object c_mutex is locked in do_abandon.
But to free a pagedresult massage id, pagedresults_free_one_msgid
called from do_abandon tries to acquire lock on c_mutex again.
The NSPR lock function PR_Lock is not self re-entrant.  Thus the
server hangs there due to the self-deadlock.

Fix Description: This patch is removing to call PR_Lock(c_mutex)
in pagedresults_free_one_msgid and renamed it to pagedresults_free_
one_msgid_nolock.  To maintain the consistency, "_nolock" is added
to other pagedresults apis which do not call PR_Lock in it.
Also, stricter locking on c_mutex is being added to pagedresults_
parse_control_value to protect the pagedresults related field in
the connection object.
Comment 2 Rich Megginson 2012-10-08 10:21:19 EDT
Upstream ticket:
https://fedorahosted.org/389/ticket/485
Comment 3 Sankar Ramalingam 2012-10-08 10:41:01 EDT
QA Acked.
Comment 4 Sankar Ramalingam 2012-10-08 10:45:54 EDT
Hi Noriko,
   Would it be possible to reproduce this issue with Directory Server?
Thanks,
-Sankar R.
Comment 5 Noriko Hosoi 2012-10-08 12:48:19 EDT
(In reply to comment #4)
> Hi Noriko,
>    Would it be possible to reproduce this issue with Directory Server?
> Thanks,
> -Sankar R.

It requires IPA testing.  I think Bruce can help you.
Comment 6 Noriko Hosoi 2012-10-08 12:58:25 EDT
Reviewed by Rich (Thank you!!)

Pushed to master.

$ git push
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.78 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   cd48bbd..c19bb9d  master -> master

Cherry-picked and pushed to external 389-ds-base-1.2.11.

$ git cherry-pick -x -e c19bb9dd1e95ee98a53a06f3d7eefb4dce5bc0ef

$ git push origin 389-ds-base-1.2.11:389-ds-base-1.2.11
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.83 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   009fd8c..4d82507  389-ds-base-1.2.11 -> 389-ds-base-1.2.11
Comment 7 Noriko Hosoi 2012-10-08 14:35:06 EDT
This fix is going to be in 1.2.11.16.
Comment 8 baiesi 2012-10-09 09:29:04 EDT
(In reply to comment #4)
Hi Sankar R. I've only seem this issue once where the system locks up and becomes unuseable.  Once the fix is in I will install, configure and run the same runtime client load to see if I can repoduce the issue once more, hopefully not...

Thanks Rich Megginson and Noriko Hosoi for being right on top of this issue...
Comment 10 baiesi 2012-10-16 16:53:28 EDT
With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in my test environment,  I'm currently running system tests watching for this specific defect.  Currently my plan is to run the test environment for a minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa clients.
Comment 11 Noriko Hosoi 2012-10-16 17:06:53 EDT
(In reply to comment #10)
> With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in
> my test environment,  I'm currently running system tests watching for this
> specific defect.  Currently my plan is to run the test environment for a
> minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa
> clients.

Thank you soooo much, Bruce!
Comment 12 baiesi 2012-10-26 15:49:16 EDT
Test Type:
-System Test Defect Verification

Test Run Duration:
-10+ days

Test Env /Config:
-Installed 2012-10-12 13:28:23 -04:00 Job=315555
-IPA Server version 3.0.0.rc2. API version 2.44 on Red Hat Enterprise Linux Server release 6.3 (Santiago)
-Installed 2 Ipa Servers and 2 Ipa clients
-10K Users divided into 10 user groups
-11 sudo rules and approx 1500 sudo commands

Installed Packages:
ipa-server-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
ipa-server-selinux-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
ipa-client-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
389-ds-base-1.2.11.15-2.el6.x86_64
389-ds-base-libs-1.2.11.15-2.el6.x86_64

Load Applied:
-The test environment has been subjected to tests that previously triggered the defect.  The test load is running through runtime ssh/sudo multi-thread load simulating 10 concurrent simulated users while applying light administration work via cli and UI. Over 10+ days approx 800k runtime attempts consisting of pos/neg ssh/sudo use cases were applied against the test environment.  The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day.

Observations:
-I'm not seeing the reported symptoms/results described in this defect. Not reproduced.

Comments:
-Requests have been made to also run this load again on the most latest Ipa code running under Rhel6.4
Comment 13 baiesi 2012-11-13 10:11:05 EST
Test Type:
-System Test Defect Verification

Test Run Duration:
-6+ days

Test Env /Config:
-Red Hat Enterprise Linux Server release 6.4 Beta (Santiago)
-IPA server version 3.0.1.pre1. API version 2.44
-Installed 2012-10-29 10:49:25 -04:00 -04:00 Job=324310
-Kernel Linux version 2.6.32-330.el6.x86_64
-Installed 2 Ipa Servers and 2 Ipa clients
-10K Users divided into 10 user groups
-11 sudo rules and approx 1500 sudo commands

Installed Packages:
ipa-server-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-client-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
libipa_hbac-python-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
ipa-pki-ca-theme-9.0.3-7.20121101T0026z.el6.noarch
libipa_hbac-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
ipa-pki-common-theme-9.0.3-7.20121101T0026z.el6.noarch
ipa-python-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-admintools-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-server-selinux-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
sssd-client-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
sssd-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
389-ds-base-libs-1.2.11.15-2.el6.x86_64
389-ds-base-debuginfo-1.2.11.15-2.el6.x86_64
389-ds-base-1.2.11.15-2.el6.x86_64

Load Applied:
-The test environment has been subjected to tests that previously triggered the defect.  The test load is running through runtime ssh/sudo multi-thread load, cycling through 10k ipa users.  Load simulating 10 concurrent simulated users 24/7 while applying light administration work via cli and UI. Over 6+ days consisting of pos/neg ssh/sudo use cases were applied against the test environment.  The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day.

Observations:
-I'm not seeing the reported symptoms/results described in this defect. Not able to reproduced, this is a good thing.

Closing the defect as Verified.
Comment 14 errata-xmlrpc 2013-02-21 03:21:01 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0503.html

Note You need to log in before you can comment on or make changes to this bug.