RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 684349 - slapd crashing when traffic replayed
Summary: slapd crashing when traffic replayed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: 389-ds-base
Version: 6.1
Hardware: i686
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Rich Megginson
QA Contact: Chandrasekar Kannan
URL:
Whiteboard:
Depends On: 683250
Blocks: 639035 389_1.2.8
TreeView+ depends on / blocked
 
Reported: 2011-03-11 21:34 UTC by Rich Megginson
Modified: 2015-01-04 23:47 UTC (History)
14 users (show)

Fixed In Version: 389-ds-base-1.2.8-0.7.rc2.el6
Doc Type: Bug Fix
Doc Text:
Clone Of: 683250
Environment:
Last Closed: 2011-05-19 12:42:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0533 0 normal SHIPPED_LIVE new package: 389-ds-base 2011-05-18 17:57:44 UTC

Description Rich Megginson 2011-03-11 21:34:09 UTC
+++ This bug was initially created as a clone of Bug #683250 +++

+++ This bug was initially created as a clone of Bug #668619 +++
--- Additional comment from jmates on 2011-02-10 13:41:41 EST ---

No hang on RHEL 5.5 with 1.2.8 a2, but instead segmentation faults when tested with replayed production traffic. Gathering gdb output...

--- Additional comment from jmates on 2011-02-10 16:05:13 EST ---

Created attachment 478119 [details]
gdb logs

gdb traces from two crashes of 389-ds-base-1.2.8-0.2.a2.el5 in response to replayed production traffic.

--- Additional comment from rmeggins on 2011-02-10 16:21:52 EST ---

(In reply to comment #11)
> Created attachment 478119 [details]
> gdb logs
> 
> gdb traces from two crashes of 389-ds-base-1.2.8-0.2.a2.el5 in response to
> replayed production traffic.

Thanks.  The stack looks corrupted.  Are you using SASL/GSSAPI Kerberos authentication?  Can you attach the last few lines from your errors and access logs before the crash?

--- Additional comment from jmates on 2011-02-10 16:51:54 EST ---

Yes, the servers are configured for Kerberos auth (via http://directory.fedoraproject.org/wiki/Howto:Kerberos). Our existing fedora-ds-1.0.4-1.RHEL4 production LDAP servers also use Kerberos auth, and do not exhibit any problems (they were setup well before my time here, though).

Errors shows just various peer rests, nothing unique at time of the crash.

[10/Feb/2011:18:18:01 +0000] - PR_Write(118) Netscape Portable Runtime error -59
61 (TCP connection reset by peer.)
[10/Feb/2011:18:18:02 +0000] - PR_Write(79) Netscape Portable Runtime error -596
1 (TCP connection reset by peer.)

Access shows nothing exciting, though that is buffered. Trying again with unbuffered logs and a higher error log level.

--- Additional comment from jmates on 2011-02-11 13:26:22 EST ---

Unbuffered logging appears to avoid the segmentation faults (for an overnight run of test traffic; buffered logging leads to crashes within minutes), though hobbles performance to the log write speed. If production load is near or beyond that limit, client requests block or timeout, which is not ideal. Shutting down the unbuffered logging ns-slapd took 10 (!) minutes after all test traffic was cut off.

--- Additional comment from rmeggins on 2011-02-11 13:36:17 EST ---

Yeah, that's why we buffer access logging by default.  You can use a named pipe script for the access log instead - http://directory.fedoraproject.org/wiki/Named_Pipe_Log_Script

--- Additional comment from jmates on 2011-02-25 20:10:58 EST ---

Created attachment 481105 [details]
gdb trace from 1.2.8 a3 segfault

gdb trace from 1.2.8 a3 segfault added (alpha 3 is harder to segfault than alpha 2).

--- Additional comment from rmeggins on 2011-02-25 21:20:40 EST ---

Centos5?  32-bit or 64-bit?  How are you generating the load?  If you are using some scripts or load client, is it possible we could get a copy?

--- Additional comment from nkinder on 2011-03-08 17:17:09 EST ---

Created attachment 483044 [details]
gdb logs

Attachment originally from Jeremy Mates:

gdb traces from two crashes of 389-ds-base-1.2.8-0.2.a2.el5 in response to
replayed production traffic.

--- Additional comment from nkinder on 2011-03-08 17:18:45 EST ---

Created attachment 483046 [details]
gdb trace from 1.2.8 a3 segfault

Attachment originally from Jeremy Mates:

gdb trace from 1.2.8 a3 segfault added (alpha 3 is harder to segfault than
alpha 2).

--- Additional comment from rmeggins on 2011-03-09 12:45:02 EST ---

Is it possible you could attach your tcpdump so we could replay it in our dev environment?  If that is not possible, would you be able to install a debug build in your testing environment?

--- Additional comment from jmates on 2011-03-09 14:11:53 EST ---

Sorry, the tcpdump contains student account names and other metadata that cannot be shared. I can easily install a debug build.

--- Additional comment from rmeggins on 2011-03-09 14:21:40 EST ---

Ok.  I'll build you an el5 32-bit package with full debugging enabled.

--- Additional comment from rmeggins on 2011-03-09 15:57:41 EST ---

Ok.  The new rpms are here:
http://rmeggins.fedorapeople.org/

download the base and the -libs package - you don't need the other 2.  Install them using rpm -ivh (or upgrade using rpm -Uvh).  If/when it crashes, and you run gdb, you'll have to use the gdb 'dir' command to tell it where to find the source code, since there is no debuginfo package.

(gdb) dir /usr/src/debug/389-ds-base-VERSION

you can use the older .a4 version of the source.

--- Additional comment from jmates on 2011-03-09 16:53:22 EST ---

Created attachment 483315 [details]
gdb trace

Crash from http://rmeggins.fedorapeople.org/ packages plus 389-ds-base-1.2.6.a4.tar.bz2 source tree.

--- Additional comment from rmeggins on 2011-03-09 21:09:40 EST ---

Thanks.  New packages for testing:
http://rmeggins.fedorapeople.org/

try these

--- Additional comment from rmeggins on 2011-03-10 13:10:39 EST ---

Created attachment 483531 [details]
0001-use-a-big-lock-in-saslbind.patch

--- Additional comment from rmeggins on 2011-03-10 18:38:29 EST ---

Created attachment 483607 [details]
0001-Bug-683250-slapd-crashing-when-traffic-replayed.patch

--- Additional comment from rmeggins on 2011-03-10 18:45:24 EST ---

Created attachment 483608 [details]
0001-Bug-683250-slapd-crashing-when-traffic-replayed.patch

missed a couple of places where I needed to Unlock

--- Additional comment from rmeggins on 2011-03-10 18:47:22 EST ---

Created attachment 483609 [details]
0001-Bug-683250-slapd-crashing-when-traffic-replayed.patch

have to call unlock before send_result

--- Additional comment from rmeggins on 2011-03-11 16:33:46 EST ---

To ssh://git.fedorahosted.org/git/389/ds.git
   34f2f30..2c8637c  master -> master
commit 2c8637c242ace8a7d61474913c861e336a7809cd
Author: Rich Megginson <rmeggins>
Date:   Wed Mar 9 18:27:05 2011 -0700
    Reviewed by: nkinder (Thanks!)
    Branch: master
    Fix Description: There was a race condition in the saslbind.c code if multip
    threads and multiple connections were doing gssapi at the same time, with
    different points of failure.  The solution is to increase the size of the
    mutex section in saslbind.c so that all access of pb->pb_conn are protected.
    Thanks to Jeremy Mates <jmates> for finding this issue and for his
    assistance in testing.
    Platforms tested: RHEL6 x86_64, Fedora 14 i386
    Flag Day: no
    Doc impact: no
To ssh://git.fedorahosted.org/git/389/ds.git
   cc578f1..fb7547f  389-ds-base-1.2.8 -> 389-ds-base-1.2.8

Comment 3 Amita Sharma 2011-05-02 14:41:24 UTC
As per Rich "We could never reproduce the crash internally, and we don't have access to the
reporter's private data he used to reproduce the crash.  So I think we can just
run our SASL stress and long duration tests and confirm that this fix did not
introduce any regressions."

Comment 4 errata-xmlrpc 2011-05-19 12:42:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0533.html


Note You need to log in before you can comment on or make changes to this bug.