Bug 863576
| Summary: | Dirsrv deadlock locking up IPA | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | baiesi | ||||
| Component: | 389-ds-base | Assignee: | Noriko Hosoi <nhosoi> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Sankar Ramalingam <sramling> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.4 | CC: | jgalipea, nhosoi, nkinder, rmeggins | ||||
| Target Milestone: | rc | ||||||
| Target Release: | 6.4 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: When abandoning a single paged results request occurs, the abandoning code tries to acquire connection lock twice.
Consequence: Since the connection lock is not self re-entrant, it waits for the lock forever and stops the server.
Fix: This patch eliminates the second lock.
Result: Abandoning a single paged results does not cause the Directory Server deadlock.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-02-21 08:21:01 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 622544 [details]
git patch file (389-ds-base-1.2.11 branch)
Bug Description: Abandon of a Simple Paged Results request causes
the self deadlock. When abandoning a simple paged result request,
the mutex for the connection object c_mutex is locked in do_abandon.
But to free a pagedresult massage id, pagedresults_free_one_msgid
called from do_abandon tries to acquire lock on c_mutex again.
The NSPR lock function PR_Lock is not self re-entrant. Thus the
server hangs there due to the self-deadlock.
Fix Description: This patch is removing to call PR_Lock(c_mutex)
in pagedresults_free_one_msgid and renamed it to pagedresults_free_
one_msgid_nolock. To maintain the consistency, "_nolock" is added
to other pagedresults apis which do not call PR_Lock in it.
Also, stricter locking on c_mutex is being added to pagedresults_
parse_control_value to protect the pagedresults related field in
the connection object.
Upstream ticket: https://fedorahosted.org/389/ticket/485 QA Acked. Hi Noriko, Would it be possible to reproduce this issue with Directory Server? Thanks, -Sankar R. (In reply to comment #4) > Hi Noriko, > Would it be possible to reproduce this issue with Directory Server? > Thanks, > -Sankar R. It requires IPA testing. I think Bruce can help you. Reviewed by Rich (Thank you!!) Pushed to master. $ git push Counting objects: 19, done. Delta compression using up to 4 threads. Compressing objects: 100% (10/10), done. Writing objects: 100% (10/10), 1.78 KiB, done. Total 10 (delta 8), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git cd48bbd..c19bb9d master -> master Cherry-picked and pushed to external 389-ds-base-1.2.11. $ git cherry-pick -x -e c19bb9dd1e95ee98a53a06f3d7eefb4dce5bc0ef $ git push origin 389-ds-base-1.2.11:389-ds-base-1.2.11 Counting objects: 19, done. Delta compression using up to 4 threads. Compressing objects: 100% (10/10), done. Writing objects: 100% (10/10), 1.83 KiB, done. Total 10 (delta 8), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 009fd8c..4d82507 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 This fix is going to be in 1.2.11.16. (In reply to comment #4) Hi Sankar R. I've only seem this issue once where the system locks up and becomes unuseable. Once the fix is in I will install, configure and run the same runtime client load to see if I can repoduce the issue once more, hopefully not... Thanks Rich Megginson and Noriko Hosoi for being right on top of this issue... With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in my test environment, I'm currently running system tests watching for this specific defect. Currently my plan is to run the test environment for a minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa clients. (In reply to comment #10) > With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in > my test environment, I'm currently running system tests watching for this > specific defect. Currently my plan is to run the test environment for a > minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa > clients. Thank you soooo much, Bruce! Test Type: -System Test Defect Verification Test Run Duration: -10+ days Test Env /Config: -Installed 2012-10-12 13:28:23 -04:00 Job=315555 -IPA Server version 3.0.0.rc2. API version 2.44 on Red Hat Enterprise Linux Server release 6.3 (Santiago) -Installed 2 Ipa Servers and 2 Ipa clients -10K Users divided into 10 user groups -11 sudo rules and approx 1500 sudo commands Installed Packages: ipa-server-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64 ipa-server-selinux-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64 ipa-client-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64 389-ds-base-1.2.11.15-2.el6.x86_64 389-ds-base-libs-1.2.11.15-2.el6.x86_64 Load Applied: -The test environment has been subjected to tests that previously triggered the defect. The test load is running through runtime ssh/sudo multi-thread load simulating 10 concurrent simulated users while applying light administration work via cli and UI. Over 10+ days approx 800k runtime attempts consisting of pos/neg ssh/sudo use cases were applied against the test environment. The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day. Observations: -I'm not seeing the reported symptoms/results described in this defect. Not reproduced. Comments: -Requests have been made to also run this load again on the most latest Ipa code running under Rhel6.4 Test Type: -System Test Defect Verification Test Run Duration: -6+ days Test Env /Config: -Red Hat Enterprise Linux Server release 6.4 Beta (Santiago) -IPA server version 3.0.1.pre1. API version 2.44 -Installed 2012-10-29 10:49:25 -04:00 -04:00 Job=324310 -Kernel Linux version 2.6.32-330.el6.x86_64 -Installed 2 Ipa Servers and 2 Ipa clients -10K Users divided into 10 user groups -11 sudo rules and approx 1500 sudo commands Installed Packages: ipa-server-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 ipa-client-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 libipa_hbac-python-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64 ipa-pki-ca-theme-9.0.3-7.20121101T0026z.el6.noarch libipa_hbac-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64 ipa-pki-common-theme-9.0.3-7.20121101T0026z.el6.noarch ipa-python-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 ipa-admintools-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 ipa-server-selinux-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 sssd-client-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64 sssd-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64 389-ds-base-libs-1.2.11.15-2.el6.x86_64 389-ds-base-debuginfo-1.2.11.15-2.el6.x86_64 389-ds-base-1.2.11.15-2.el6.x86_64 Load Applied: -The test environment has been subjected to tests that previously triggered the defect. The test load is running through runtime ssh/sudo multi-thread load, cycling through 10k ipa users. Load simulating 10 concurrent simulated users 24/7 while applying light administration work via cli and UI. Over 6+ days consisting of pos/neg ssh/sudo use cases were applied against the test environment. The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day. Observations: -I'm not seeing the reported symptoms/results described in this defect. Not able to reproduced, this is a good thing. Closing the defect as Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0503.html |
Description of problem: Dirsrv deadlock in abandon Version-Release number of selected component (if applicable): ipa-server-2.2.0-16.el6.x86_64 389-ds-base-1.2.11.13-1.el6.x86_64 How reproducible: Have not reproduced yet... Test Env: 2 IPA Master Servers 2 IPA Clients 10K users 10 Groups 1k per group Sudo command setup Activities: -Running Manually triggered admin through the UI (del sudo rules) -Hourly running scheduled ssh sudo runtime client load (10 threads) System running load fine but and began failing early morning 5:00AM. System became inoperative. Could not able to kinit, unable to restart. Cause determined to be a DirSrv deadlock. Additional info: Dev Noriko Hosoi indicates: This is a self-deadlock in abandon... I see 3 threads waiting for the same mutex. The first one thread 23 already acquired the mutex in do_abandon, but it tries to grab it again in pagedresults_free_one_msgid. Thread 23 (Thread 0x7f446b458700 (LWP 22023)): #0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0 #2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so #4 0x00000030014889f9 in pagedresults_free_one_msgid (conn=0x7f4470e68670, msgid=31) at ldap/servers/slapd/pagedresults.c:272 #5 0x000000000040d4ad in do_abandon (pb=0xa56ee70) at ldap/servers/slapd/abandon.c:156 #6 0x0000000000414104 in connection_dispatch_operation () at ldap/servers/slapd/connection.c:646 #7 connection_threadmain () at ldap/servers/slapd/connection.c:2338 #8 0x0000003001c299e3 in ?? () from /lib64/libnspr4.so #9 0x0000003ffec07851 in start_thread () from /lib64/libpthread.so.0 #10 0x0000003ffe8e767d in clone () from /lib64/libc.so.6 (gdb) p conn $6 = (Connection *) 0x7f4470e68670 (gdb) p *conn $7 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3, c_dn = 0xa760d90 "fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com", c_isroot = 0, c_isreplication_session = 0, c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0, c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80, cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640, c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852, c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57, c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0, c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590, c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50, c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0, c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990, c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0, c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = { prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30}, c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0} Thread 19 (Thread 0x7f4468c54700 (LWP 22027)): #0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0 #2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so #4 0x00000030014883fd in pagedresults_set_search_result (conn=0x7f4470e68670, sr=0x0, locked=0, index=0) at ldap/servers/slapd/pagedresults.c:360 #5 0x00007f447b1819fc in ldbm_back_next_search_entry_ext (pb=0xb02fca0, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1426 #6 0x00000030014854c1 in iterate (pb=0xb02fca0, be=<value optimized out>, pnentries=0x7f4468c51394, pagesize=1000, pr_statp=0x7f4468c51388, send_result=1) at ldap/servers/slapd/opshared.c:1250 #7 0x00000030014859d7 in send_results_ext (pb=0xb02fca0, nentries=0x7f4468c51394, pagesize=1000, pr_stat=0x7f4468c51388, send_result=1) at ldap/servers/slapd/opshared.c:1651 #8 0x00000030014866cb in op_shared_search (pb=0xb02fca0, send_result=1) at ldap/servers/slapd/opshared.c:828 #9 0x00000000004263a4 in do_search (pb=<value optimized out>) at ldap/servers/slapd/search.c:400 #10 0x000000000041426a in connection_dispatch_operation () at ldap/servers/slapd/connection.c:621 #11 connection_threadmain () at ldap/servers/slapd/connection.c:2338 #12 0x0000003001c299e3 in ?? () from /lib64/libnspr4.so #13 0x0000003ffec07851 in start_thread () from /lib64/libpthread.so.0 #14 0x0000003ffe8e767d in clone () from /lib64/libc.so.6 (gdb) p conn $4 = (Connection *) 0x7f4470e68670 (gdb) p *conn $5 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3, c_dn = 0xa760d90 "fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com", c_isroot = 0, c_isreplication_session = 0, c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0, c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80, cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640, c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852, c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57, c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0, c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590, c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50, c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0, c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990, c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0, c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = { prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30}, c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0} Thread 1 (Thread 0x7f44818d87c0 (LWP 21797)): #0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0 #2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so #4 0x0000000000417ef6 in setup_pr_read_pds (ports=0x7fff13810910) at ldap/servers/slapd/daemon.c:1675 #5 slapd_daemon (ports=0x7fff13810910) at ldap/servers/slapd/daemon.c:1144 #6 0x000000000041f22f in main (argc=7, argv=0x7fff13810ca8) at ldap/servers/slapd/main.c:1253 (gdb) p c $1 = (Connection *) 0x7f4470e68670 (gdb) p *c $2 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3, c_dn = 0xa760d90 "fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com", c_isroot = 0, c_isreplication_session = 0, c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0, c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80, cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640, c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852, c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57, c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0, c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590, c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50, c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0, c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990, c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0, c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = { prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30}, c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}