863576 – Dirsrv deadlock locking up IPA

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 863576 - Dirsrv deadlock locking up IPA

Summary: Dirsrv deadlock locking up IPA

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	6.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	6.4
Assignee:	Noriko Hosoi
QA Contact:	Sankar Ramalingam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-05 20:49 UTC by baiesi
Modified:	2020-09-13 20:18 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When abandoning a single paged results request occurs, the abandoning code tries to acquire connection lock twice. Consequence: Since the connection lock is not self re-entrant, it waits for the lock forever and stops the server. Fix: This patch eliminates the second lock. Result: Abandoning a single paged results does not cause the Directory Server deadlock.
Clone Of:
Environment:
Last Closed:	2013-02-21 08:21:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
git patch file (389-ds-base-1.2.11 branch) (12.98 KB, patch) 2012-10-06 01:10 UTC, Noriko Hosoi	rmeggins: review+	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 485	0	None	None	None	2020-09-13 20:18:24 UTC
Red Hat Product Errata	RHSA-2013:0503	0	normal	SHIPPED_LIVE	Moderate: 389-ds-base security, bug fix, and enhancement update	2013-02-21 08:18:44 UTC

Description baiesi 2012-10-05 20:49:21 UTC

Description of problem:
Dirsrv deadlock in abandon

Version-Release number of selected component (if applicable):
ipa-server-2.2.0-16.el6.x86_64
389-ds-base-1.2.11.13-1.el6.x86_64

How reproducible: Have not reproduced yet...

Test Env:
2 IPA Master Servers
2 IPA Clients
10K users
10 Groups 1k per group
Sudo command setup

Activities:
-Running Manually triggered admin through the UI (del sudo rules)
-Hourly running scheduled ssh sudo runtime client load (10 threads)

System running load fine but and began failing early morning 5:00AM.  System became inoperative.  Could not able to kinit, unable to restart. Cause determined to be a DirSrv deadlock.

Additional info:
Dev Noriko Hosoi indicates:

This is a self-deadlock in abandon... I see 3 threads waiting for the
same mutex. The first one thread 23 already acquired the mutex in
do_abandon, but it tries to grab it again in
pagedresults_free_one_msgid.

Thread 23 (Thread 0x7f446b458700 (LWP 22023)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x00000030014889f9 in pagedresults_free_one_msgid (conn=0x7f4470e68670,
msgid=31) at ldap/servers/slapd/pagedresults.c:272
#5 0x000000000040d4ad in do_abandon (pb=0xa56ee70)
at ldap/servers/slapd/abandon.c:156
#6 0x0000000000414104 in connection_dispatch_operation ()
at ldap/servers/slapd/connection.c:646
#7 connection_threadmain () at ldap/servers/slapd/connection.c:2338
#8 0x0000003001c299e3 in ?? () from /lib64/libnspr4.so
#9 0x0000003ffec07851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003ffe8e767d in clone () from /lib64/libc.so.6
(gdb) p conn
$6 = (Connection *) 0x7f4470e68670
(gdb) p *conn
$7 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}

Thread 19 (Thread 0x7f4468c54700 (LWP 22027)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x00000030014883fd in pagedresults_set_search_result
(conn=0x7f4470e68670,
sr=0x0, locked=0, index=0) at ldap/servers/slapd/pagedresults.c:360
#5 0x00007f447b1819fc in ldbm_back_next_search_entry_ext (pb=0xb02fca0,
use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1426
#6 0x00000030014854c1 in iterate (pb=0xb02fca0, be=<value optimized out>,
pnentries=0x7f4468c51394, pagesize=1000, pr_statp=0x7f4468c51388,
send_result=1) at ldap/servers/slapd/opshared.c:1250
#7 0x00000030014859d7 in send_results_ext (pb=0xb02fca0,
nentries=0x7f4468c51394, pagesize=1000, pr_stat=0x7f4468c51388,
send_result=1) at ldap/servers/slapd/opshared.c:1651
#8 0x00000030014866cb in op_shared_search (pb=0xb02fca0, send_result=1)
at ldap/servers/slapd/opshared.c:828
#9 0x00000000004263a4 in do_search (pb=<value optimized out>)
at ldap/servers/slapd/search.c:400
#10 0x000000000041426a in connection_dispatch_operation ()
at ldap/servers/slapd/connection.c:621
#11 connection_threadmain () at ldap/servers/slapd/connection.c:2338
#12 0x0000003001c299e3 in ?? () from /lib64/libnspr4.so
#13 0x0000003ffec07851 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003ffe8e767d in clone () from /lib64/libc.so.6
(gdb) p conn
$4 = (Connection *) 0x7f4470e68670
(gdb) p *conn
$5 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}

Thread 1 (Thread 0x7f44818d87c0 (LWP 21797)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x0000000000417ef6 in setup_pr_read_pds (ports=0x7fff13810910)
at ldap/servers/slapd/daemon.c:1675
#5 slapd_daemon (ports=0x7fff13810910) at ldap/servers/slapd/daemon.c:1144
#6 0x000000000041f22f in main (argc=7, argv=0x7fff13810ca8)
at ldap/servers/slapd/main.c:1253
(gdb) p c
$1 = (Connection *) 0x7f4470e68670
(gdb) p *c
$2 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}

Comment 1 Noriko Hosoi 2012-10-06 01:10:57 UTC

Created attachment 622544 [details]
git patch file (389-ds-base-1.2.11 branch)

Bug Description: Abandon of a Simple Paged Results request causes
the self deadlock.  When abandoning a simple paged result request,
the mutex for the connection object c_mutex is locked in do_abandon.
But to free a pagedresult massage id, pagedresults_free_one_msgid
called from do_abandon tries to acquire lock on c_mutex again.
The NSPR lock function PR_Lock is not self re-entrant.  Thus the
server hangs there due to the self-deadlock.

Fix Description: This patch is removing to call PR_Lock(c_mutex)
in pagedresults_free_one_msgid and renamed it to pagedresults_free_
one_msgid_nolock.  To maintain the consistency, "_nolock" is added
to other pagedresults apis which do not call PR_Lock in it.
Also, stricter locking on c_mutex is being added to pagedresults_
parse_control_value to protect the pagedresults related field in
the connection object.

Comment 2 Rich Megginson 2012-10-08 14:21:19 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/485

Comment 3 Sankar Ramalingam 2012-10-08 14:41:01 UTC

QA Acked.

Comment 4 Sankar Ramalingam 2012-10-08 14:45:54 UTC

Hi Noriko,
   Would it be possible to reproduce this issue with Directory Server?
Thanks,
-Sankar R.

Comment 5 Noriko Hosoi 2012-10-08 16:48:19 UTC

(In reply to comment #4)
> Hi Noriko,
>    Would it be possible to reproduce this issue with Directory Server?
> Thanks,
> -Sankar R.

It requires IPA testing.  I think Bruce can help you.

Comment 6 Noriko Hosoi 2012-10-08 16:58:25 UTC

Reviewed by Rich (Thank you!!)

Pushed to master.

$ git push
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.78 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   cd48bbd..c19bb9d  master -> master

Cherry-picked and pushed to external 389-ds-base-1.2.11.

$ git cherry-pick -x -e c19bb9dd1e95ee98a53a06f3d7eefb4dce5bc0ef

$ git push origin 389-ds-base-1.2.11:389-ds-base-1.2.11
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.83 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   009fd8c..4d82507  389-ds-base-1.2.11 -> 389-ds-base-1.2.11

Comment 7 Noriko Hosoi 2012-10-08 18:35:06 UTC

This fix is going to be in 1.2.11.16.

Comment 8 baiesi 2012-10-09 13:29:04 UTC

(In reply to comment #4)
Hi Sankar R. I've only seem this issue once where the system locks up and becomes unuseable.  Once the fix is in I will install, configure and run the same runtime client load to see if I can repoduce the issue once more, hopefully not...

Thanks Rich Megginson and Noriko Hosoi for being right on top of this issue...

Comment 10 baiesi 2012-10-16 20:53:28 UTC

With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in my test environment,  I'm currently running system tests watching for this specific defect.  Currently my plan is to run the test environment for a minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa clients.

Comment 11 Noriko Hosoi 2012-10-16 21:06:53 UTC

(In reply to comment #10)
> With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in
> my test environment,  I'm currently running system tests watching for this
> specific defect.  Currently my plan is to run the test environment for a
> minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa
> clients.

Thank you soooo much, Bruce!

Comment 12 baiesi 2012-10-26 19:49:16 UTC

Test Type:
-System Test Defect Verification

Test Run Duration:
-10+ days

Test Env /Config:
-Installed 2012-10-12 13:28:23 -04:00 Job=315555
-IPA Server version 3.0.0.rc2. API version 2.44 on Red Hat Enterprise Linux Server release 6.3 (Santiago)
-Installed 2 Ipa Servers and 2 Ipa clients
-10K Users divided into 10 user groups
-11 sudo rules and approx 1500 sudo commands

Installed Packages:
ipa-server-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
ipa-server-selinux-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
ipa-client-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
389-ds-base-1.2.11.15-2.el6.x86_64
389-ds-base-libs-1.2.11.15-2.el6.x86_64

Load Applied:
-The test environment has been subjected to tests that previously triggered the defect.  The test load is running through runtime ssh/sudo multi-thread load simulating 10 concurrent simulated users while applying light administration work via cli and UI. Over 10+ days approx 800k runtime attempts consisting of pos/neg ssh/sudo use cases were applied against the test environment.  The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day.

Observations:
-I'm not seeing the reported symptoms/results described in this defect. Not reproduced.

Comments:
-Requests have been made to also run this load again on the most latest Ipa code running under Rhel6.4

Comment 13 baiesi 2012-11-13 15:11:05 UTC

Test Type:
-System Test Defect Verification

Test Run Duration:
-6+ days

Test Env /Config:
-Red Hat Enterprise Linux Server release 6.4 Beta (Santiago)
-IPA server version 3.0.1.pre1. API version 2.44
-Installed 2012-10-29 10:49:25 -04:00 -04:00 Job=324310
-Kernel Linux version 2.6.32-330.el6.x86_64
-Installed 2 Ipa Servers and 2 Ipa clients
-10K Users divided into 10 user groups
-11 sudo rules and approx 1500 sudo commands

Installed Packages:
ipa-server-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-client-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
libipa_hbac-python-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
ipa-pki-ca-theme-9.0.3-7.20121101T0026z.el6.noarch
libipa_hbac-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
ipa-pki-common-theme-9.0.3-7.20121101T0026z.el6.noarch
ipa-python-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-admintools-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-server-selinux-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
sssd-client-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
sssd-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
389-ds-base-libs-1.2.11.15-2.el6.x86_64
389-ds-base-debuginfo-1.2.11.15-2.el6.x86_64
389-ds-base-1.2.11.15-2.el6.x86_64

Load Applied:
-The test environment has been subjected to tests that previously triggered the defect.  The test load is running through runtime ssh/sudo multi-thread load, cycling through 10k ipa users.  Load simulating 10 concurrent simulated users 24/7 while applying light administration work via cli and UI. Over 6+ days consisting of pos/neg ssh/sudo use cases were applied against the test environment.  The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day.

Observations:
-I'm not seeing the reported symptoms/results described in this defect. Not able to reproduced, this is a good thing.

Closing the defect as Verified.

Comment 14 errata-xmlrpc 2013-02-21 08:21:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0503.html

Note You need to log in before you can comment on or make changes to this bug.